# install necessary packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# suppress some annoying warnings
import warnings
='ignore', category=FutureWarning)
warnings.simplefilter(action
='darkgrid') # https://seaborn.pydata.org/generated/seaborn.set_theme.html
sns.set_theme(style"notebook") # set an overall scale. Notebook is the default. In increasing size: paper, notebook, talk, poster.
sns.set_context('font.sans-serif'] = ['Calibri','Segoe UI','Arial'] # use a nicer font in matplotlib (if available) plt.rcParams[
Visualisation with Seaborn
Python has a rich ecosystem of libraries for data visualisation, each with different strengths. Some popular options include matplotlib for fine control over plots, plotly for interactive visualisations, and bokeh for web-ready dashboards. In this session, we’ll be using seaborn. It’s built on top of matplotlib but offers a simpler, high-level interface and nice looking default styles — it’s therefore a good choice when you who want to quickly create clear and informative plots without needing to tweak every detail.
We are using Australian weather data, taken from Kaggle. This dataset is used to build machine learning models that predict whether it will rain tomorrow, using data about the weather every day from 2007 to 2017. To download the data, click here.
One final note before we get started - This page is a combination of text and python code. We’ve tried to explain clearly what we’re about to do before we do it, but do also note the # comments
within the python code cells themselves that occasionally explain a specific line of code in more detail.
Initial setup
We’re going to import some python packages. Remember that the plt
, np
, sns
aliases are just for convenience - we could omit this completely or use different aliases if we prefer.
sns
?
Seaborn being imported as sns
is an odd convention (they are the initials of the fictional character the package was named for) that will make it easier to read or copy/paste online examples.
As before, we need to import our dataset. We’re importing the csv
file into an initial dataframe called df
to start with.
# import the dataset
= pd.read_csv('data/weatherAUS.csv') df
Data manipulation
Column conversions
Before we start actually generating some visuals, we need to make sure our Date column contains proper datetimes. We’re also going to drop the years with partial data so that our dataset only has full years. Finally we’re going to change the RainTomorrow field to contain a 0 or a 1 rather than yes/no.
# convert date column to datetime
'Date'] = pd.to_datetime(df['Date'])
df[
# create a column that just contains the year
'Year'] = df['Date'].dt.year
df[
# drop the partial years (2007,2017)
= df[~df['Year'].isin([2007,2017])]
df
# convert 'RainTomorrow' to a numeric variable, where 'Yes' = 1 and 'No' = 0.
'RainToday']=df['RainToday'].replace({'Yes': 1, 'No': 0, 'NA':0}).fillna(0).astype(int)
df['RainTomorrow']=df['RainTomorrow'].map({'Yes': 1, 'No': 0,'NA': 0}).fillna(0).astype(int);
df[
# little tip: the semicolon suppresses textual output when we don't want it
Sort order and other helper tables
We need a month order for our “mmm” months - there is probably an official way of doing this…
= ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] month_order
We also need a sort order for our city names to use as a column order for some of our charts later. We’ll just arrange them alphabetically.
=['Adelaide','Brisbane','Melbourne','Perth','Sydney'] colOrder_top5Cities
To enhance a chart we’re going to build later, we’re going to dynamically calculate some text describing our data range.
# Calculate the date range dynamically; we're going to use this later...
= df['Date'].min().strftime('%Y')
date_min = df['Date'].max().strftime('%Y')
date_max = f"{date_min} - {date_max}"
date_range
print(date_range)
2008 - 2016
Pivoting and grouping
Next, we’re going to create some helper dataframes by filtering, grouping and pivoting the data. These will be used for different types of visuals later. Of course, we could have just created these groupings and pivots inline when we do the actual visualisation, but we’re doing it this way because:
- It’s easier to follow
- It’s tidier (and probably faster) to create these dataframes once as we’re going to be using them multiple times.
# build a month column
'Month'] = df['Date'].dt.strftime('%b') # Add a column that just contains the month in mmm format
df['Month'] = pd.Categorical(df['Month'], categories=month_order, ordered=True) # Make it categorical using our custom order so that it appears in the right order
df[
# we're going to filter to top 5 cities from now on
= df[df['Location'].isin(['Perth','Adelaide','Sydney','Melbourne','Brisbane'])]
df_top5Cities
# a dataframe with the number of rainy days per year and month, and location
= df_top5Cities.groupby(['Location','Year', 'Month'])['RainToday'].sum().reset_index()
df_top5Cities_rainyDays
# finally, we're going to create some grouped and pivoted dataframes. Picture these as PivotTables in Excel.
= df_top5Cities.groupby(['Location', 'Month'])['Rainfall'].mean().reset_index()
df_top5Cities_Rainfall_grouped = df_top5Cities_Rainfall_grouped.pivot(index="Location",columns="Month", values="Rainfall")
df_top5Cities_Rainfall_pivoted = df_top5Cities.groupby(['Location', 'Month','Year'])['RainToday'].sum().reset_index().groupby(['Location','Month'])['RainToday'].mean().reset_index().pivot(index="Location",columns="Month", values="RainToday") df_top5Cities_monthly_rainyDays_pivoted
Let’s use head()
to make sure we understand what each grouped/pivoted DF is for.
2) df_top5Cities_Rainfall_grouped.head(
Location | Month | Rainfall | |
---|---|---|---|
0 | Adelaide | Jan | 0.672199 |
1 | Adelaide | Feb | 0.973604 |
2) df_top5Cities_Rainfall_pivoted.head(
Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Location | ||||||||||||
Adelaide | 0.672199 | 0.973604 | 1.171667 | 1.461165 | 2.293860 | 2.447009 | 2.873606 | 2.399237 | 1.809125 | 0.906273 | 0.730827 | 1.192500 |
Brisbane | 6.415574 | 5.325389 | 4.442276 | 3.165385 | 3.126446 | 2.516318 | 1.000000 | 1.273381 | 1.314498 | 2.419424 | 3.347761 | 4.551613 |
2) df_top5Cities_monthly_rainyDays_pivoted.head(
Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Location | ||||||||||||
Adelaide | 2.555556 | 1.555556 | 3.666667 | 4.000000 | 7.666667 | 8.555556 | 13.444444 | 11.444444 | 7.888889 | 4.222222 | 4.000000 | 4.555556 |
Brisbane | 8.000000 | 7.111111 | 10.000000 | 5.333333 | 5.555556 | 6.222222 | 4.111111 | 3.555556 | 4.111111 | 5.888889 | 6.888889 | 8.777778 |
df[df[...
?
= df[df['Location'].isin(['Perth','Adelaide','Sydney','Melbourne','Brisbane'])] df_top5Cities
- The first (outer)
df[
tells pandas that we want to select a subset of rows based on some condition. - The second (inner)
df[
is going to tell pandas this condition. In this case, we’re usingisin
to return a dataframe that contains a series ofTrue
andFalse
rows corresponding to whether the rows in our original dataframe had theLocation
column as one of our 5 cities. - The final dataframe is then a filtered copy where the inner condition is
True
.
Yes, there are other ways of doing this! For example by using .query()
to specify our conditions.
Doing some actual plotting
The Seaborn home page has a very good introductory tutorial, reference documentation, and a nice collection of examples. You should familiarise yourself with the documentation; it’ll pay off massively if you actually grasp what each function and argument is for, rather than just copy/pasting examples and tweaking them until they work (without really understanding what they’re doing).
A basic one-line line chart
sns.lineplot(=df_top5Cities_Rainfall_pivoted.T #.T gives the transpose (flips rows and columns)
data )
Just a little bit of customisation - a bar chart
https://seaborn.pydata.org/generated/seaborn.barplot.html
= sns.barplot(
our_barplot =df_top5Cities_Rainfall_grouped
data="Month"
,x="Rainfall"
,y="Location" # read "hue" as "series"
,hue="tab10" # https://matplotlib.org/stable/users/explain/colors/colormaps.html
,palette
)
set(title='Average daily rainfall by month and city',ylim=(0,8))
our_barplot.
"upper left", title=None, ncol=4) # https://seaborn.pydata.org/generated/seaborn.move_legend.html sns.move_legend(our_barplot,
sns.barplot
call?
The barplot
function provided by Seaborn doesn’t actually allow setting of a title - it just generates a plot (including its axes) and returns this as a matplotlib Axes
object (recall we mentioned earlier that Seaborn is a layer on top of the matplotlib library). By using the .set(...)
method on our barplot object, we can modify this returned object to give it a title. We also could have used this to customise our axis labels (the defaults are fine here), set axis limits, or things like tick labels.
Heatmaps
The Seaborn heatmap
function will easily let us create a two-dimensional heatmap visual with a specific colour theme and custom number formatting.
# We need to use some matplotlib code to set our output size, add a title, and capitalise our x-axis label
= plt.subplots(figsize=(10, 5)) # matplotlib subplots are a common way of setting a figure layout
f,ax f"Average daily rainfall (/mm) each month ({date_range}) for Australia's top 5 cities", fontsize=16, fontweight="bold", pad=10) # using our previously set date_range variable
ax.set_title(
# Heatmap expects rectangular (pivot-like) data
sns.heatmap(df_top5Cities_Rainfall_pivoted =True # Put numbers inside the cells
,annot=".1f" # Make the numbers have 1 decimal place
,fmt=True # Square vs rectangular cells
,square=False # Get rid of the colourbar legend on the side
,cbar="Blues" # Seems appropriate for rainfall. Colourmaps reference: https://matplotlib.org/stable/users/explain/colors/colormaps.html
,cmap=ax # Tell it to use the matplotlib axes we created earlier
,ax )
Another heatmap with some further tweaks
We can make our heatmap look just a little better by apply some tweaks to the subplots object.
# Again setting up matplotlib subplots so that we can make some changes later
= plt.subplots(figsize=(10, 5))
f,ax
# Heatmap expects rectangular (pivot-like) data
sns.heatmap(df_top5Cities_monthly_rainyDays_pivoted =True # Put numbers inside the cells
,annot=".0f" # Force the number format
,fmt=True # Square vs rectangular cells
,square=False # Get rid of the colourbar legend on the side
,cbar="crest" # Colourmaps reference: https://matplotlib.org/stable/users/explain/colors/colormaps.html
,cmap=ax # Tell it to use the matplotlib axes we created earlier
,ax
)
# We need to use some matplotlib code to set our output size, add a title, and capitalise our x-axis label
='x', labelsize=11, rotation=45) # I think 45-degree month labels look nicer, but this is a matter of taste.
ax.tick_params(axis='y', labelsize=11)
ax.tick_params(axis
# Manually changing our axis labels for more control
"Month",fontweight="bold",fontsize=12)
ax.set_xlabel("City",fontweight="bold",fontsize=12)
ax.set_ylabel(
# Set our title dynamically
f"Mean number of rainy days by month between {date_min} and {date_max} for Australia's top 5 cities", fontsize=16, fontweight="bold", pad=15); ax.set_title(
A fancy multi-chart visual
This chart uses the boxgrid
object to arrange multiple different subcharts. We’re actually generating two sets of different visuals (linegrid
and boxgrid
) in one output. If you’re not sure what the for [...] in [...]
syntax means, don’t worry - this will be covered in a future session.
# Setting up the grid of box plots
# Box plots are a bit of a rabbit hole and are extremely customisable; we're mostly using defaults here
= sns.FacetGrid(df_top5Cities
boxgrid ="Location" # Defining the different facets
,col=5, height=4.5 # Layout and sizing for our facet grid
,col_wrap=colOrder_top5Cities # Using our alphabetical order of city names to arrange our facets
,col_order
)map(sns.boxplot # This is what tells sns what sort of plots we want in our grid
boxgrid."Month" # X
,"MaxTemp" # Y
,=1.5
,linewidth="skyblue"
,color=month_order
,order=0 # Seaborn boxplots by default include markers for outliers, which it calls "fliers". For this chart we'd like to disable these.
,fliersize
)
# Setting up the grid of line plots
# Using the default matplotlib plot here
= sns.FacetGrid(df_top5Cities.groupby(['Location', 'Month'])['Rainfall'].mean().reset_index()
linegrid ="Location"
,col="Location",palette="Set2"
,hue=5, height=4.5
,col_wrap=colOrder_top5Cities
,col_order
)map(plt.plot, "Month", "Rainfall",marker="o")
linegrid.
# Formatting axes
for ax in boxgrid.axes.flat:
='x', labelsize=9, rotation=45)
ax.tick_params(axis='y', labelsize=9)
ax.tick_params(axisfor ax in linegrid.axes.flat:
='x', labelsize=9, rotation=45)
ax.tick_params(axis='y', labelsize=9)
ax.tick_params(axis
="{col_name}",fontweight="bold",fontsize=16)
linegrid.set_titles(col_template"Month","Average rainfall /mm",fontweight="bold",fontsize=10)
linegrid.set_axis_labels(="{col_name}",fontweight="bold",fontsize=16)
boxgrid.set_titles(col_template"Month","Max temp /$\\degree$C",fontweight="bold",fontsize=10)
boxgrid.set_axis_labels(
# Setting overall titles and spacing
f"Mean daily rainfall by month for top 5 cities, {date_range}", fontsize=16, color='black',fontweight='bold')
linegrid.fig.suptitle(f"Max temperature by month for top 5 cities, {date_range}", fontsize=16, color='black',fontweight='bold')
boxgrid.fig.suptitle(=0.85)
linegrid.fig.subplots_adjust(top=0.85); boxgrid.fig.subplots_adjust(top
“Small multiples”-type time-series grid
This is mostly based on the example given at https://seaborn.pydata.org/examples/timeseries_facets.html.
Firstly, for this one we need another dataframe that just contains the Sydney data.
= df[df['Location']=="Sydney"].groupby(['Month', 'Location','Year'], as_index=False)['MaxTemp'].max() df_Sydney
We’re going to use the relplot
function to create a grid of plots with a specific set of variables across its rows and columns. For each cell we’re plotting that year’s data with a different colour, and plotting all the other years in grey in the background.
# plot each year's time series in its own facet
= sns.relplot(data=df_Sydney
g ="Month"
,x="MaxTemp"
,y="Year"
,col="Year"
,hue="line"
,kind="viridis"
,palette=4
,linewidth=5
,zorder=3, height=2, aspect=1.5, legend=False,
,col_wrap
)
# iterate over each subplot to customize further
for year, ax in g.axes_dict.items():
# Add the title as an annotation within the plot
.8
ax.text(.85
,
,year=ax.transAxes
,transform="bold",fontsize=9)
,fontweight
# Plot every year's time series in the background
sns.lineplot(=df_Sydney, x="Month", y="MaxTemp", units="Year",
data=None, color=".7", linewidth=1, ax=ax
estimator
)
# reduce the frequency of the x axis ticks
2])
ax.set_xticks(ax.get_xticks()[::
# tweak the supporting aspects of the plot
"")
g.set_titles(f"Max temperature by month in Sydney, {date_range}", fontsize=16, color='black',fontweight='bold')
g.fig.suptitle("", "Max Temp /$\\degree \\mathrm{C}$");
g.set_axis_labels(; g.tight_layout()