Exploring Data Using Pandas

This is the first of four sessions looking at how to explore data in Python. This session will focus on introducing the Python library, pandas. We will use pandas to import, inspect, summarise, and transform the data, illustrating a typical exploratory data analysis workflow.

We are using Australian weather data, taken from Kaggle. This dataset is used to build machine learning models that predict whether it will rain tomorrow, using data about the weather every day from 2007 to 2017. To download the data, click here.

# install necessary packages
# !uv add skimpy

# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from skimpy import skim

# import the dataset
df = pd.read_csv('data/weatherAUS.csv')

Setting the Scene

Before we start to explore any dataset, we need to establish what we are looking to do with the data. This should inform our decisions wwith any exploration, and any analysis that follows.

Questions:

What are we trying to achieve?
How do our goals impact our analysis?
What should we take into consideration before we write any code?
What sort of questions might we be interested in with this dataset?

What Our Data Can Tell Us (And What it Can’t)

We also need to consider what the data is and where it came from.

Questions:

How was the data collected?
What is it missing?
What do the variables in our dataset actually mean, and are they a good approximation of the concepts we are interested in?

Exploring the Dataset

First, we should start with dataset-wide operations.

Questions:

What do we want to know about a dataset when we first encounter it?
How do we get a quick overview of the data that can help us in our next steps?
We need to get a “feel” for the data before we can really make any decisions about how to analyse it. How do we get there with a new dataset?

We can start by getting a quick glance at the data. The starting point when you have just imported a new dataset is usually the pandas function pd.DataFrame.head(), which shows the top \(n\) rows of the dataset (by default it shows the top five rows).

# view the top five rows
df.head()

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	2008-12-01	Albury	13.4	22.9	0.6	NaN	NaN	W	44.0	W	...	71.0	22.0	1007.7	1007.1	8.0	NaN	16.9	21.8	No	No
1	2008-12-02	Albury	7.4	25.1	0.0	NaN	NaN	WNW	44.0	NNW	...	44.0	25.0	1010.6	1007.8	NaN	NaN	17.2	24.3	No	No
2	2008-12-03	Albury	12.9	25.7	0.0	NaN	NaN	WSW	46.0	W	...	38.0	30.0	1007.6	1008.7	NaN	2.0	21.0	23.2	No	No
3	2008-12-04	Albury	9.2	28.0	0.0	NaN	NaN	NE	24.0	SE	...	45.0	16.0	1017.6	1012.8	NaN	NaN	18.1	26.5	No	No
4	2008-12-05	Albury	17.5	32.3	1.0	NaN	NaN	W	41.0	ENE	...	82.0	33.0	1010.8	1006.0	7.0	8.0	17.8	29.7	No	No

5 rows × 23 columns

You can also look at the bottom rows of the dataset, using pd.DataFrame.tail(). This might be useful if you are dealing with time-series data. Below, we specify that we want to look at the bottom ten rows.

# view the bottom ten rows
df.tail(10)

	Date	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
145450	2017-06-16	Uluru	5.2	24.3	NaN	NaN	E	24.0	SE	...	53.0	24.0	1023.8	1020.0	NaN	NaN	12.3	23.3	No	No
145451	2017-06-17	Uluru	6.4	23.4	NaN	NaN	ESE	31.0	S	...	53.0	25.0	1025.8	1023.0	NaN	NaN	11.2	23.1	No	No
145452	2017-06-18	Uluru	8.0	20.7	NaN	NaN	ESE	41.0	SE	...	56.0	32.0	1028.1	1024.3	NaN	7.0	11.6	20.0	No	No
145453	2017-06-19	Uluru	7.4	20.6	NaN	NaN	E	35.0	ESE	...	63.0	33.0	1027.2	1023.3	NaN	NaN	11.0	20.3	No	No
145454	2017-06-20	Uluru	3.5	21.8	NaN	NaN	E	31.0	ESE	...	59.0	27.0	1024.7	1021.2	NaN	NaN	9.4	20.9	No	No
145455	2017-06-21	Uluru	2.8	23.4	NaN	NaN	E	31.0	SE	...	51.0	24.0	1024.6	1020.3	NaN	NaN	10.1	22.4	No	No
145456	2017-06-22	Uluru	3.6	25.3	NaN	NaN	NNW	22.0	SE	...	56.0	21.0	1023.5	1019.1	NaN	NaN	10.9	24.5	No	No
145457	2017-06-23	Uluru	5.4	26.9	NaN	NaN	N	37.0	SE	...	53.0	24.0	1021.0	1016.8	NaN	NaN	12.5	26.1	No	No
145458	2017-06-24	Uluru	7.8	27.0	NaN	NaN	SE	28.0	SSE	...	51.0	24.0	1019.4	1016.5	3.0	2.0	15.1	26.0	No	No
145459	2017-06-25	Uluru	14.9	NaN	NaN	NaN	NaN	NaN	ESE	...	62.0	36.0	1020.2	1017.9	8.0	8.0	15.0	20.9	No	NaN

10 rows × 23 columns

A quick glimpse at the data is useful, but we may also want to get quick descriptions of several aspects of the data. Such as the length of the dataset (len(), which can also be used to get the length of various Python objects), which tells us how many observations we have.

# get the object length
len(df)

Another option is pd.DataFrame.shape(), which shows the length (number of rows) and width (number of columns).

# get the object shape (number of rows, number of columns)
df.shape

(145460, 23)

Speaking of columns, if we want a quick list of the column names, we can get this using pd.DataFrame.columns().

# get all column names
df.columns

Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

A quick and easy way to get some valuable information about the dataset is pd.DataFrame.info(), including the total non-null observations and data type¹ of each column.

# get dataframe info (column indices, non-null counts, data types)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object 
 22  RainTomorrow   142193 non-null  object 
dtypes: float64(16), object(7)
memory usage: 25.5+ MB

If we wanted to get a better sense of the null values in each column, we could calculate the percentage of null values by capturing whether each row of each column is null (pd.DataFrame.isnull()), summing the total null values in each column (pd.DataFrame.sum()), and then dividing by the length of the dataframe (/len()).

# calculate the percentage of null values in each column
df.isnull().sum()/len(df)

Date             0.000000
Location         0.000000
MinTemp          0.010209
MaxTemp          0.008669
Rainfall         0.022419
Evaporation      0.431665
Sunshine         0.480098
WindGustDir      0.070989
WindGustSpeed    0.070555
WindDir9am       0.072639
WindDir3pm       0.029066
WindSpeed9am     0.012148
WindSpeed3pm     0.021050
Humidity9am      0.018246
Humidity3pm      0.030984
Pressure9am      0.103568
Pressure3pm      0.103314
Cloud9am         0.384216
Cloud3pm         0.408071
Temp9am          0.012148
Temp3pm          0.024811
RainToday        0.022419
RainTomorrow     0.022460
dtype: float64

If we want a quick summary of all the numeric columns in the dataset, we can use pd.DataFrame.describe().

# quick summary of numeric variables
df.describe()

	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustSpeed	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
count	143975.000000	144199.000000	142199.000000	82670.000000	75625.000000	135197.000000	143693.000000	142398.000000	142806.000000	140953.000000	130395.00000	130432.000000	89572.000000	86102.000000	143693.000000	141851.00000
mean	12.194034	23.221348	2.360918	5.468232	7.611178	40.035230	14.043426	18.662657	68.880831	51.539116	1017.64994	1015.255889	4.447461	4.509930	16.990631	21.68339
std	6.398495	7.119049	8.478060	4.193704	3.785483	13.607062	8.915375	8.809800	19.029164	20.795902	7.10653	7.037414	2.887159	2.720357	6.488753	6.93665
min	-8.500000	-4.800000	0.000000	0.000000	0.000000	6.000000	0.000000	0.000000	0.000000	0.000000	980.50000	977.100000	0.000000	0.000000	-7.200000	-5.40000
25%	7.600000	17.900000	0.000000	2.600000	4.800000	31.000000	7.000000	13.000000	57.000000	37.000000	1012.90000	1010.400000	1.000000	2.000000	12.300000	16.60000
50%	12.000000	22.600000	0.000000	4.800000	8.400000	39.000000	13.000000	19.000000	70.000000	52.000000	1017.60000	1015.200000	5.000000	5.000000	16.700000	21.10000
75%	16.900000	28.200000	0.800000	7.400000	10.600000	48.000000	19.000000	24.000000	83.000000	66.000000	1022.40000	1020.000000	7.000000	7.000000	21.600000	26.40000
max	33.900000	48.100000	371.000000	145.000000	14.500000	135.000000	130.000000	87.000000	100.000000	100.000000	1041.00000	1039.600000	9.000000	9.000000	40.200000	46.70000

However, I prefer to bring in another package, skimpy, that does all of this very quickly and cleanly. We can get a detailed description of the entire dataset using skim().

# a more informative summary function from the skimpy package
skim(df)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ Dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 145460 │ │ float64     │ 16    │                                                          │
│ │ Number of columns │ 23     │ │ string      │ 7     │                                                          │
│ └───────────────────┴────────┘ └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━┓  │
│ ┃ column         ┃ NA     ┃ NA %                ┃ mean  ┃ sd    ┃ p0    ┃ p25  ┃ p50  ┃ p75  ┃ p100 ┃ hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━┩  │
│ │ MinTemp        │   1485 │  1.0208992162793895 │ 12.19 │ 6.398 │  -8.5 │  7.6 │   12 │ 16.9 │ 33.9 │  ▃▇▇▃  │  │
│ │ MaxTemp        │   1261 │  0.8669049910628351 │ 23.22 │ 7.119 │  -4.8 │ 17.9 │ 22.6 │ 28.2 │ 48.1 │  ▁▇▇▃  │  │
│ │ Rainfall       │   3261 │   2.241853430496356 │ 2.361 │ 8.478 │     0 │    0 │    0 │  0.8 │  371 │   ▇    │  │
│ │ Evaporation    │  62790 │    43.1665062560154 │ 5.468 │ 4.194 │     0 │  2.6 │  4.8 │  7.4 │  145 │   ▇    │  │
│ │ Sunshine       │  69835 │   48.00976213391998 │ 7.611 │ 3.785 │     0 │  4.8 │  8.4 │ 10.6 │ 14.5 │ ▃▃▅▆▇▃ │  │
│ │ WindGustSpeed  │  10263 │   7.055547916953114 │ 40.04 │ 13.61 │     6 │   31 │   39 │   48 │  135 │  ▂▇▂   │  │
│ │ WindSpeed9am   │   1767 │   1.214766946239516 │ 14.04 │ 8.915 │     0 │    7 │   13 │   19 │  130 │   ▇▂   │  │
│ │ WindSpeed3pm   │   3062 │   2.105046060772721 │ 18.66 │  8.81 │     0 │   13 │   19 │   24 │   87 │  ▅▇▂   │  │
│ │ Humidity9am    │   2654 │  1.8245565791282827 │ 68.88 │ 19.03 │     0 │   57 │   70 │   83 │  100 │  ▁▂▇▇▆ │  │
│ │ Humidity3pm    │   4507 │    3.09844630826344 │ 51.54 │  20.8 │     0 │   37 │   52 │   66 │  100 │ ▁▅▆▇▅▂ │  │
│ │ Pressure9am    │  15065 │     10.356799120033 │  1018 │ 7.107 │ 980.5 │ 1013 │ 1018 │ 1022 │ 1041 │   ▂▇▅  │  │
│ │ Pressure3pm    │  15028 │  10.331362573903478 │  1015 │ 7.037 │ 977.1 │ 1010 │ 1015 │ 1020 │ 1040 │   ▂▇▅  │  │
│ │ Cloud9am       │  55888 │   38.42155919153032 │ 4.447 │ 2.887 │     0 │    1 │    5 │    7 │    9 │ ▇▂▃▂▇▅ │  │
│ │ Cloud3pm       │  59358 │   40.80709473394748 │  4.51 │  2.72 │     0 │    2 │    5 │    7 │    9 │ ▆▂▃▂▇▃ │  │
│ │ Temp9am        │   1767 │   1.214766946239516 │ 16.99 │ 6.489 │  -7.2 │ 12.3 │ 16.7 │ 21.6 │ 40.2 │  ▂▇▇▃  │  │
│ │ Temp3pm        │   3609 │  2.4810944589577892 │ 21.68 │ 6.937 │  -5.4 │ 16.6 │ 21.1 │ 26.4 │ 46.7 │  ▁▇▇▃  │  │
│ └────────────────┴────────┴─────────────────────┴───────┴───────┴───────┴──────┴──────┴──────┴──────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓  │
│ ┃          ┃       ┃          ┃          ┃          ┃          ┃          ┃ chars per ┃ words    ┃ total     ┃  │
│ ┃ column   ┃ NA    ┃ NA %     ┃ shortest ┃ longest  ┃ min      ┃ max      ┃ row       ┃ per row  ┃ words     ┃  │
│ ┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩  │
│ │ Date     │     0 │        0 │ 2008-12- │ 2008-12- │ 2007-11- │ 2017-06- │        10 │        1 │    145460 │  │
│ │          │       │          │ 01       │ 01       │ 01       │ 25       │           │          │           │  │
│ │ Location │     0 │        0 │ Sale     │ Melbourn │ Adelaide │ Woomera  │      8.71 │        1 │    145460 │  │
│ │          │       │          │          │ eAirport │          │          │           │          │           │  │
│ │ WindGust │ 10326 │ 7.098858 │ W        │ WNW      │ E        │ WSW      │      2.19 │     0.93 │    135134 │  │
│ │ Dir      │       │ 79279527 │          │          │          │          │           │          │           │  │
│ │ WindDir9 │ 10566 │ 7.263852 │ W        │ NNW      │ E        │ WSW      │      2.18 │     0.93 │    134894 │  │
│ │ am       │       │ 60552729 │          │          │          │          │           │          │           │  │
│ │          │       │        2 │          │          │          │          │           │          │           │  │
│ │ WindDir3 │  4228 │ 2.906641 │ E        │ WNW      │ E        │ WSW      │      2.21 │     0.97 │    141232 │  │
│ │ pm       │       │ 00096246 │          │          │          │          │           │          │           │  │
│ │          │       │        4 │          │          │          │          │           │          │           │  │
│ │ RainToda │  3261 │ 2.241853 │ No       │ Yes      │ No       │ Yes      │      2.22 │     0.98 │    142199 │  │
│ │ y        │       │ 43049635 │          │          │          │          │           │          │           │  │
│ │          │       │        6 │          │          │          │          │           │          │           │  │
│ │ RainTomo │  3267 │ 2.245978 │ No       │ Yes      │ No       │ Yes      │      2.22 │     0.98 │    142193 │  │
│ │ rrow     │       │ 27581465 │          │          │          │          │           │          │           │  │
│ │          │       │        7 │          │          │          │          │           │          │           │  │
│ └──────────┴───────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┴──────────┴───────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

Exploring Variables (Columns) & Observations (Rows)

If we are going to narrow our focus to specific variables or groups of observations, we need to know how to select columns, filter values, and group the data. There are lots of different ways we can slice up the data. We won’t cover all of them here², but we will try to cover a range that helps illustrate how pandas works and will help you build the intuition for working with data in pandas.

We can select columns in a variety of ways, but the “correct” way to select columns in most circumstances is using selection brackets (the square brackets []), also known as the indexing operator.

# selecting a single column by name
df['Date']

# alternative ways to select columns
# df.loc[:, 'Date']
# df.Date

0         2008-12-01
1         2008-12-02
2         2008-12-03
3         2008-12-04
4         2008-12-05
             ...    
145455    2017-06-21
145456    2017-06-22
145457    2017-06-23
145458    2017-06-24
145459    2017-06-25
Name: Date, Length: 145460, dtype: object

If we want to select multiple columns, we can use double squared brackets ([[ ]]). This is the same process as before, but the inner brackets define a list, and the outer are the selection brackets.

# selecting multiple columns (and all rows) by name
df[['Date', 'Location', 'Rainfall']]
# df.loc[:, ['Date', 'Location', 'Rainfall']]

	Date	Location	Rainfall
0	2008-12-01	Albury	0.6
1	2008-12-02	Albury	0.0
2	2008-12-03	Albury	0.0
3	2008-12-04	Albury	0.0
4	2008-12-05	Albury	1.0
...	...	...	...
145455	2017-06-21	Uluru	0.0
145456	2017-06-22	Uluru	0.0
145457	2017-06-23	Uluru	0.0
145458	2017-06-24	Uluru	0.0
145459	2017-06-25	Uluru	0.0

145460 rows × 3 columns

While selection brackets are a quick and easy solution if we want to grab a subset of variables in the dataset, it is realy only intended to be used for simple operations using only column selection.

For row selection, we should use pd.DataFrame.iloc[]. The iloc function is used for “integer position” selection, which means you can select rows or columns using their integer position. For rows 10-15, you can select them using the following:

# slicing by rows
df.iloc[10:16]

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
10	2008-12-11	Albury	13.4	30.4	0.0	NaN	NaN	N	30.0	SSE	...	48.0	22.0	1011.8	1008.7	NaN	NaN	20.4	28.8	No	Yes
11	2008-12-12	Albury	15.9	21.7	2.2	NaN	NaN	NNE	31.0	NE	...	89.0	91.0	1010.5	1004.2	8.0	8.0	15.9	17.0	Yes	Yes
12	2008-12-13	Albury	15.9	18.6	15.6	NaN	NaN	W	61.0	NNW	...	76.0	93.0	994.3	993.0	8.0	8.0	17.4	15.8	Yes	Yes
13	2008-12-14	Albury	12.6	21.0	3.6	NaN	NaN	SW	44.0	W	...	65.0	43.0	1001.2	1001.8	NaN	7.0	15.8	19.8	Yes	No
14	2008-12-15	Albury	8.4	24.6	0.0	NaN	NaN	NaN	NaN	S	...	57.0	32.0	1009.7	1008.7	NaN	NaN	15.9	23.5	No	NaN
15	2008-12-16	Albury	9.8	27.7	NaN	NaN	NaN	WNW	50.0	NaN	...	50.0	28.0	1013.4	1010.3	0.0	NaN	17.3	26.2	NaN	No

6 rows × 23 columns

We can do similar using a column’s integer position, but we have to select all rows (:) first:

# using iloc with columns
df.iloc[:, 20]

0         21.8
1         24.3
2         23.2
3         26.5
4         29.7
          ... 
145455    22.4
145456    24.5
145457    26.1
145458    26.0
145459    20.9
Name: Temp3pm, Length: 145460, dtype: float64

Finally, we can put both together to take a subset of both rows and columns:

# using iloc with rows and columns
df.iloc[10:16, 20]

10    28.8
11    17.0
12    15.8
13    19.8
14    23.5
15    26.2
Name: Temp3pm, dtype: float64

However, selecting by integer position is relatively limited. It is more likely we would want to subset the data based on the values of certain columns. We can filter rows by condition using pd.DataFrame.loc[]. The loc function slices by label, instead of integer position.

For example, we might want to look at a subset of the data based on location.

# select all observations in Perth
df.loc[df['Location'] == 'Perth']

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
120638	2008-07-01	Perth	2.7	18.8	0.0	0.8	9.1	ENE	20.0	NaN	...	97.0	53.0	1027.6	1024.5	2.0	3.0	8.5	18.1	No	No
120639	2008-07-02	Perth	6.4	20.7	0.0	1.8	7.0	NE	22.0	ESE	...	80.0	39.0	1024.1	1019.0	0.0	6.0	11.1	19.7	No	No
120640	2008-07-03	Perth	6.5	19.9	0.4	2.2	7.3	NE	31.0	NaN	...	84.0	71.0	1016.8	1015.6	1.0	3.0	12.1	17.7	No	Yes
120641	2008-07-04	Perth	9.5	19.2	1.8	1.2	4.7	W	26.0	NNE	...	93.0	73.0	1019.3	1018.4	6.0	6.0	13.2	17.7	Yes	Yes
120642	2008-07-05	Perth	9.5	16.4	1.8	1.4	4.9	WSW	44.0	W	...	69.0	57.0	1020.4	1022.1	7.0	5.0	15.9	16.0	Yes	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
123826	2017-06-21	Perth	10.3	19.9	0.2	1.8	7.5	NW	37.0	NNE	...	89.0	60.0	1017.1	1013.8	5.0	6.0	13.0	18.5	No	Yes
123827	2017-06-22	Perth	13.0	16.8	61.2	3.6	0.0	SSW	46.0	W	...	90.0	75.0	1005.6	1008.9	7.0	7.0	16.4	15.6	Yes	No
123828	2017-06-23	Perth	13.3	18.9	0.4	1.8	6.5	SE	37.0	SE	...	85.0	65.0	1019.2	1019.4	6.0	6.0	15.1	18.0	No	No
123829	2017-06-24	Perth	11.5	18.2	0.0	3.8	9.3	SE	30.0	ESE	...	62.0	47.0	1025.9	1023.4	1.0	3.0	14.0	17.6	No	No
123830	2017-06-25	Perth	6.3	17.0	0.0	1.6	7.9	E	26.0	SE	...	75.0	49.0	1028.6	1026.0	1.0	3.0	11.5	15.6	No	No

3193 rows × 23 columns

We can also filter by multiple values, such as location and rainfall.

df.loc[(df['Rainfall'] == 0) & (df['Location'] == 'Perth')]

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
120638	2008-07-01	Perth	2.7	18.8	0.0	0.8	9.1	ENE	20.0	NaN	...	97.0	53.0	1027.6	1024.5	2.0	3.0	8.5	18.1	No	No
120639	2008-07-02	Perth	6.4	20.7	0.0	1.8	7.0	NE	22.0	ESE	...	80.0	39.0	1024.1	1019.0	0.0	6.0	11.1	19.7	No	No
120644	2008-07-07	Perth	0.7	18.3	0.0	0.8	9.3	N	37.0	NE	...	72.0	36.0	1028.9	1024.2	1.0	5.0	8.7	17.9	No	No
120645	2008-07-08	Perth	3.2	20.4	0.0	1.4	6.9	NNW	24.0	NE	...	58.0	42.0	1023.9	1021.1	6.0	5.0	10.2	19.3	No	Yes
120651	2008-07-14	Perth	7.9	19.7	0.0	0.2	6.5	NE	31.0	NE	...	86.0	41.0	1026.0	1021.9	6.0	5.0	11.7	18.7	No	No
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
123823	2017-06-18	Perth	7.5	23.4	0.0	1.8	9.2	NNE	28.0	ENE	...	67.0	41.0	1026.9	1022.9	0.0	0.0	14.2	22.2	No	No
123824	2017-06-19	Perth	5.5	23.0	0.0	3.0	9.1	SW	19.0	ENE	...	84.0	55.0	1023.0	1020.3	1.0	2.0	11.5	22.0	No	No
123825	2017-06-20	Perth	7.8	22.5	0.0	2.8	9.1	NW	26.0	W	...	98.0	59.0	1019.3	1015.9	1.0	1.0	13.5	21.6	No	No
123829	2017-06-24	Perth	11.5	18.2	0.0	3.8	9.3	SE	30.0	ESE	...	62.0	47.0	1025.9	1023.4	1.0	3.0	14.0	17.6	No	No
123830	2017-06-25	Perth	6.3	17.0	0.0	1.6	7.9	E	26.0	SE	...	75.0	49.0	1028.6	1026.0	1.0	3.0	11.5	15.6	No	No

2293 rows × 23 columns

For any complex process for subsetting the data, including multiple conditions, pd.DataFrame.loc[] is the best bet.

Summarising Data

Now that we know how to select the variables or observations we are interested in, we can start doing some descriptive analysis. The operations we use will depend on the questions we are trying to answer, and the possibilities will be almost endless.

Questions:

What “functions” might we need to carry out on our data when we are exploring it?

We know that the weather data includes observations from all over the country, but we might want to check exactly how many different locations there are. We can use pd.DataFrame.nunique() to do this.

# count unique values
df['Location'].nunique()

We may also be interested in the locations themselves, which may tell us more about the spatial distribution of our data. In this case, we can use pd.DataFrame.unique().

# get unique values
df['Location'].unique()

array(['Albury', 'BadgerysCreek', 'Cobar', 'CoffsHarbour', 'Moree',
       'Newcastle', 'NorahHead', 'NorfolkIsland', 'Penrith', 'Richmond',
       'Sydney', 'SydneyAirport', 'WaggaWagga', 'Williamtown',
       'Wollongong', 'Canberra', 'Tuggeranong', 'MountGinini', 'Ballarat',
       'Bendigo', 'Sale', 'MelbourneAirport', 'Melbourne', 'Mildura',
       'Nhil', 'Portland', 'Watsonia', 'Dartmoor', 'Brisbane', 'Cairns',
       'GoldCoast', 'Townsville', 'Adelaide', 'MountGambier', 'Nuriootpa',
       'Woomera', 'Albany', 'Witchcliffe', 'PearceRAAF', 'PerthAirport',
       'Perth', 'SalmonGums', 'Walpole', 'Hobart', 'Launceston',
       'AliceSprings', 'Darwin', 'Katherine', 'Uluru'], dtype=object)

Another common operation we might look to do is calculating the mean value (pd.DataFrame.mean()) of a certain variable. What is the average value of sunshine across the entire dataset?

# calculate variable mean
df['Sunshine'].mean()

np.float64(7.611177520661157)

This gives us the mean to many decimal places, and we probably don’t need to know the average sunshine hours to this level of precision. We can use the pd.DataFrame.round() function to round to two decimal places.

# round mean value
df['Sunshine'].mean().round(2)

np.float64(7.61)

Many operations will return the value with information about the object’s type included. The above values are wrapped in np.float64() because pd.DataFrame.mean() uses numpy to calculate the mean value. However, if you want to strip this information out so you only see the value itself, you can use print().

# print mean value
print(df['Sunshine'].mean().round(2))

7.61

While we are often interested in the mean value when we talk about averages, we might want to know the median instead (pd.DataFrame.median()).

# calculate other summary statistics
print(df['Sunshine'].median())

8.4

Another common calculation is summing values (pd.DataFrame.sum()). We can use sum() to see the total hours of sunshine in our dataset, and we can use int() to convert this value to an integer (which also means we don’t need to use print()³).

# calculate sum value and return an integer
int(df['Sunshine'].sum())

We can also apply these summary operations on multiple variables, using the same selection logic as before (using double squared brackets).

print(df[['Sunshine', 'Rainfall']].mean())

Sunshine    7.611178
Rainfall    2.360918
dtype: float64

And we can apply multiple functions, using pd.DataFrame.agg().

df['Sunshine'].agg(['mean', 'median', 'sum']).round(1)

mean           7.6
median         8.4
sum       575595.3
Name: Sunshine, dtype: float64

The next step when exploring specific variables will often be group-level summaries. The average amount of sunshine across the whole dataset has limited utility, but the average hours of sunshine in each location allows us to compare between locations and start to understand how different variables are related to each other. If we want to do a group-level operation, we have to use pd.DataFrame.groupby().

# calculate group means
df.groupby(by='Location')['Sunshine'].mean().round(1)

Location
Adelaide            7.7
Albany              6.7
Albury              NaN
AliceSprings        9.6
BadgerysCreek       NaN
Ballarat            NaN
Bendigo             NaN
Brisbane            8.1
Cairns              7.6
Canberra            7.4
Cobar               8.7
CoffsHarbour        7.4
Dartmoor            6.5
Darwin              8.5
GoldCoast           NaN
Hobart              6.6
Katherine           NaN
Launceston          NaN
Melbourne           6.4
MelbourneAirport    6.4
Mildura             8.5
Moree               8.9
MountGambier        6.5
MountGinini         NaN
Newcastle           NaN
Nhil                NaN
NorahHead           NaN
NorfolkIsland       7.0
Nuriootpa           7.7
PearceRAAF          8.8
Penrith             NaN
Perth               8.8
PerthAirport        8.8
Portland            6.5
Richmond            NaN
Sale                6.7
SalmonGums          NaN
Sydney              7.2
SydneyAirport       7.2
Townsville          8.5
Tuggeranong         NaN
Uluru               NaN
WaggaWagga          8.2
Walpole             NaN
Watsonia            6.4
Williamtown         7.2
Witchcliffe         NaN
Wollongong          NaN
Woomera             9.0
Name: Sunshine, dtype: float64

The groupby(by='Location') function tells us the grouping variable (location), then we select the variable we want to summarise by location (sunshine), and then we specify the operation (mean).

There are multiple locations that return NaN (Not a Number). This indicates that numpy was unable to calculate a mean value for those locations. This is likely to be because all sunshine values for those locations are null.

We can check this using pd.DataFrame.count(), which counts the total non-null values (whereas pd.DataFrame.size() counts the total values).

# group by location and count non-null sunshine values
df.groupby('Location')['Sunshine'].count()

Location
Adelaide            1769
Albany              2520
Albury                 0
AliceSprings        2520
BadgerysCreek          0
Ballarat               0
Bendigo                0
Brisbane            3144
Cairns              2564
Canberra            1521
Cobar                550
CoffsHarbour        1494
Dartmoor            2566
Darwin              3189
GoldCoast              0
Hobart              3179
Katherine              0
Launceston             0
Melbourne           3192
MelbourneAirport    3008
Mildura             2876
Moree               2055
MountGambier        2597
MountGinini            0
Newcastle              0
Nhil                   0
NorahHead              0
NorfolkIsland       2570
Nuriootpa           2848
PearceRAAF          3004
Penrith                0
Perth               3188
PerthAirport        3004
Portland            2566
Richmond               0
Sale                1818
SalmonGums             0
Sydney              3328
SydneyAirport       2993
Townsville          2617
Tuggeranong            0
Uluru                  0
WaggaWagga          2575
Walpole                0
Watsonia            3008
Williamtown         1355
Witchcliffe            0
Wollongong             0
Woomera             2007
Name: Sunshine, dtype: int64

The results show that all the locations that return NaN in our group mean calculation have zero non-null values.

Transforming Data

Datasets are rarely perfectly clean and tidy. We often need to transform the data before we can get the most out of it.

Questions:

What sort of transformations would help us get the most out of the analysis of the Australian weather data?

The first step with any analysis is often converting columns to the correct types. With a longitudinal (time-series) dataset,the date column is a good place to start. We can use pd.DataFrame.dtypes to check the data type, either of a single column (using the selector brackets) or all columns in the dataset.

print(df.dtypes)

Date              object
Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object

All columns are either stored as object or float64. The object data type is for generic non-numeric data, but from the columns that are stored as objects, we can tell this is mostly categorical variables where the categories are represented as text. The float64 data type refers to data that is numeric and includes decimals (float64 = 64-bit floating point number).

The date column is stored as an object, but pandas can store dates as datetime64. We can convert dates using pd.to_datetime(). When transforming data, if we want to keep those transformations, we have to store those changes, using =. In this case, we want to convert the date column but we don’t want to create an entirely new dataframe to handle this change, so we can overwrite the current date column by using the selection brackets to identify the column we want to apply this change to.

# convert date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

The remaining object columns can be converted to categorical, which makes them easier to work with in subsequent analyses. We can use pd.DataFrame.astype() to convert column data types.

# create a list of all object columns
object_cols = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

# convert object columns to category
df[object_cols] = df[object_cols].astype('category')

A more efficient, though synactically more complex, way of doing this is using lamda functions. We won’t cover lambda functons in this session (they will be discussed in detail in a future session), but below is how we can use them to convert objects to categories.

# convert object columns to category data type
df = df.apply(lambda x: x.astype('category') if x.dtype == 'object' else x)

Another choice we might make is to remove missing values, using pd.DataFrame.dropna() to filter the null values and keep only the non-null values. We can use this to drop all null values across the entire dataset, or we can apply it to a subset of columns, using the subset argument.

# filter observations where sunshine is NA
df.dropna(subset='Sunshine')

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
6049	2009-01-01	Cobar	17.9	35.2	0.0	12.0	12.3	SSW	48.0	ENE	...	20.0	13.0	1006.3	1004.4	2.0	5.0	26.6	33.4	No	No
6050	2009-01-02	Cobar	18.4	28.9	0.0	14.8	13.0	S	37.0	SSE	...	30.0	8.0	1012.9	1012.1	1.0	1.0	20.3	27.0	No	No
6051	2009-01-03	Cobar	15.5	34.1	0.0	12.6	13.3	SE	30.0	NaN	...	NaN	7.0	NaN	1011.6	NaN	1.0	NaN	32.7	No	No
6052	2009-01-04	Cobar	19.4	37.6	0.0	10.8	10.6	NNE	46.0	NNE	...	42.0	22.0	1012.3	1009.2	1.0	6.0	28.7	34.9	No	No
6053	2009-01-05	Cobar	21.9	38.4	0.0	11.4	12.2	WNW	31.0	WNW	...	37.0	22.0	1012.7	1009.1	1.0	5.0	29.1	35.6	No	No
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
142298	2017-06-20	Darwin	19.3	33.4	0.0	6.0	11.0	ENE	35.0	SE	...	63.0	32.0	1013.9	1010.5	0.0	1.0	24.5	32.3	No	No
142299	2017-06-21	Darwin	21.2	32.6	0.0	7.6	8.6	E	37.0	SE	...	56.0	28.0	1014.6	1011.2	7.0	0.0	24.8	32.0	No	No
142300	2017-06-22	Darwin	20.7	32.8	0.0	5.6	11.0	E	33.0	E	...	46.0	23.0	1015.3	1011.8	0.0	0.0	24.8	32.1	No	No
142301	2017-06-23	Darwin	19.5	31.8	0.0	6.2	10.6	ESE	26.0	SE	...	62.0	58.0	1014.9	1010.7	1.0	1.0	24.8	29.2	No	No
142302	2017-06-24	Darwin	20.2	31.7	0.0	5.6	10.7	ENE	30.0	ENE	...	73.0	32.0	1013.9	1009.7	6.0	5.0	25.4	31.0	No	No

75625 rows × 23 columns

We haven’t stored this transformation, because filtering nulls without careful consideration is a bad idea, but it’s useful to know, nonetheless.

There are lots of ways we could transform the data, but the final example we will consider here is reshaping the data using pd.DataFrame.pivot(), which transforms the data from long to wide format data, and pd.DataFrame.melt(), which transforms it from wide to long format.

Perhaps we want to focus on the maximum temperature per day in each location in 2015. We can use pd.Series.dt.year to get the year from the date column, and filter for the year 2015, before reshaping the data.

df2015 = df.loc[df['Date'].dt.year == 2015]
df_wide = df2015.pivot(index='Date', columns='Location', values='MaxTemp')

df_wide.head()

Location	Adelaide	Albany	Albury	AliceSprings	BadgerysCreek	Ballarat	Bendigo	Brisbane	Cairns	Canberra	...	Townsville	Tuggeranong	Uluru	WaggaWagga	Walpole	Watsonia	Williamtown	Witchcliffe	Wollongong	Woomera
Date
2015-01-01	37.0	21.9	33.5	40.3	34.7	27.4	31.2	31.3	33.7	32.6	...	32.6	32.1	42.0	35.2	23.6	28.3	33.7	25.0	25.3	39.2
2015-01-02	44.1	21.2	39.6	41.4	30.5	38.2	39.8	30.5	33.7	35.2	...	33.0	34.1	42.4	38.9	21.1	40.6	29.3	23.6	24.6	43.3
2015-01-03	38.2	21.5	38.3	36.4	34.3	37.5	40.3	28.9	33.6	34.7	...	28.1	33.7	39.8	37.5	21.8	39.5	32.8	23.0	25.7	44.7
2015-01-04	30.5	23.3	33.1	29.0	34.8	23.5	29.0	30.2	29.4	32.5	...	31.6	32.8	36.1	33.8	24.4	25.1	34.5	29.8	25.3	37.6
2015-01-05	34.9	24.9	35.2	27.1	27.2	26.6	33.6	28.1	31.4	29.6	...	31.6	28.9	38.8	34.9	29.5	25.7	27.0	31.7	23.1	38.3

5 rows × 49 columns

Perhaps we want to look at the maximum and minimum temperatures in each location, together. We can reshape the data to support this⁴.

df_long = df2015.melt(
    id_vars=['Date', 'Location'],
    value_vars=['MaxTemp', 'MinTemp'],
    var_name='Variable',
    value_name='Value'
)

df_long.head()

	Date	Location	Variable	Value
0	2015-01-01	Albury	MaxTemp	33.5
1	2015-01-02	Albury	MaxTemp	39.6
2	2015-01-03	Albury	MaxTemp	38.3
3	2015-01-04	Albury	MaxTemp	33.1
4	2015-01-05	Albury	MaxTemp	35.2

Exercises

Some of these questions are easily answered by scrolling up and finding the answer in the output of the above code, however, the goal is to find the answer using code. No one actually cares what the answer to any of these questions is, it’s the process that matters!

Remember, if you don’t know the answer, it’s okay to Google it (or speak to others, including me, for help)!

Import Data (to Reset)

# import the dataset
df = pd.read_csv('data/weatherAUS.csv')

What is the ‘Sunshine’ column’s data type?

Solution

# What is the 'Sunshine' column's data type?
print(df['Sunshine'].dtypes)

float64

Identify all the columns that are of dtype ‘object’.

Solution

# Identify all the columns that are of dtype 'object'
print(list(df.select_dtypes(include=['object'])))

['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

How many of the dataframe’s columns are of dtype ‘object’?

Solution

# How many of the dataframe's columns are of dtype 'object'?
len(list(df.select_dtypes(include=['object'])))

How many of the ‘Rainfall’ column values are NAs?

Solution

# How many of the 'Rainfall' column values are NAs?
print(df['Rainfall'].isna().sum())

Create a new dataframe which only includes the ‘Date’, ‘Location, ’Sunshine’, ‘Rainfall’, and ‘RainTomorrow’ columns.

Solution

new_df = df[['Date', 'Location', 'Sunshine', 'Rainfall', 'RainTomorrow']]
new_df.head()

	Date	Location	Sunshine	Rainfall	RainTomorrow
0	2008-12-01	Albury	NaN	0.6	No
1	2008-12-02	Albury	NaN	0.0	No
2	2008-12-03	Albury	NaN	0.0	No
3	2008-12-04	Albury	NaN	0.0	No
4	2008-12-05	Albury	NaN	1.0	No

Convert ‘RainTomorrow’ to a numeric variable, where ‘Yes’ = 1 and ‘No’ = 0.

Solution

# df['Location'].astype('category').cat.codes
# df['RainTomorrow'].astype('category').cat.codes
df['RainTomorrow'].map({'Yes': 1, 'No': 0})

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
145455    0.0
145456    0.0
145457    0.0
145458    0.0
145459    NaN
Name: RainTomorrow, Length: 145460, dtype: float64

What is the average amount of rainfall for each location?

Solution

# average rainfall by location, sorted by value
df.groupby('Location')['Rainfall'].mean().sort_values(ascending=False)

Location
Cairns              5.742035
Darwin              5.092452
CoffsHarbour        5.061497
GoldCoast           3.769396
Wollongong          3.594903
Williamtown         3.591108
Townsville          3.485592
NorahHead           3.387299
Sydney              3.324543
MountGinini         3.292260
Katherine           3.201090
Newcastle           3.183892
Brisbane            3.144891
NorfolkIsland       3.127665
SydneyAirport       3.009917
Walpole             2.906846
Witchcliffe         2.895664
Portland            2.530374
Albany              2.263859
BadgerysCreek       2.193101
Penrith             2.175304
Tuggeranong         2.164043
Dartmoor            2.146567
Richmond            2.138462
MountGambier        2.087562
Launceston          2.011988
Albury              1.914115
Perth               1.906295
Melbourne           1.870062
Watsonia            1.860820
PerthAirport        1.761648
Canberra            1.741720
Ballarat            1.740026
WaggaWagga          1.709946
PearceRAAF          1.669080
Moree               1.630203
Bendigo             1.619380
Hobart              1.601819
Adelaide            1.566354
Sale                1.510167
MelbourneAirport    1.451977
Nuriootpa           1.390343
Cobar               1.127309
SalmonGums          1.034382
Mildura             0.945062
Nhil                0.934863
AliceSprings        0.882850
Uluru               0.784363
Woomera             0.490405
Name: Rainfall, dtype: float64

What is the average amount of rainfall for days that it will rain tomorrow?

Solution

# average rainfall depending on whether it will rain tomorrow or not
df.groupby('RainTomorrow')['Rainfall'].mean()

RainTomorrow
No     1.270290
Yes    6.142104
Name: Rainfall, dtype: float64

What is the average amount of sunshine in Perth when it will not rain tomorrow?

Solution

# average sunshine in Perth when it won't rain tomorrow
df.loc[(df['Location'] == 'Perth') & (df['RainTomorrow'] == 'No'), 'Sunshine'].mean()
# df[(df['Location']=='Perth') & (df['RainTomorrow']=='No')]['Sunshine'].mean()

np.float64(9.705306603773584)

We want to understand the role that time plays in the dataset. Using the original dataframe, carry the following tasks and answer the corresponding questions:
- Create columns representing the year and month from the ‘Date’ column. How many years of data are in the dataset?
- Examine the distribution of the ‘Sunshine’ NAs over time. Is time a component in the ‘Sunshine’ data quality issues?
- Calculate the average rainfall and sunshine by month. How do rainfall and sunshine vary through the year?
- Calculate the average rainfall and sunshine by year. How have rainfall and sunshine changed over time?

Solution

# get year and month columns
df = (
    df.assign(Date=pd.to_datetime(df['Date']))
    .assign(
        Year=lambda x: x['Date'].dt.year,
        Month=lambda x: x['Date'].dt.month
    )
)

# count unique years
df['Year'].nunique()

# lambda function counting nulls by year
df.groupby('Year')['Sunshine'].apply(lambda x: x.isna().sum())

Year
2007        0
2008      323
2009     6146
2010     6220
2011     6053
2012     6539
2013     7570
2014     9157
2015     9441
2016    11994
2017     6392
Name: Sunshine, dtype: int64

# rainfall and sunshine by month
df.groupby('Month')[['Rainfall', 'Sunshine']].mean().round(1)

	Rainfall	Sunshine
Month
1	2.7	9.2
2	3.2	8.6
3	2.8	7.6
4	2.3	7.1
5	2.0	6.3
6	2.8	5.6
7	2.2	6.1
8	2.0	7.1
9	1.9	7.7
10	1.6	8.5
11	2.3	8.7
12	2.5	9.0

# rainfall and sunshine by year
df.groupby('Year')[['Rainfall', 'Sunshine']].mean().round(1)

	Rainfall	Sunshine
Year
2007	3.2	8.1
2008	2.3	7.8
2009	2.2	7.9
2010	2.7	7.3
2011	2.8	7.3
2012	2.4	7.6
2013	2.3	7.7
2014	2.0	7.8
2015	2.2	7.7
2016	2.4	7.6
2017	2.5	7.7

Footnotes

For more information about pandas data types, check out the pandas documentation on dtypes.↩︎
For more information, I’d recommend the pandas documentation, and this pandas tutorial on subsetting data.↩︎
Some functions should be wrapped in print() in order to return a value that is easy to read, but others won’t. There will be an internal logic for which is which, but it’s not of huge importance to us. You are better off just testing functions out and wrapping them in print() if necessary.↩︎
This is often very useful when we need to visualise data, for example plotting the max and min temp for each location, is easier if the values are organised in the same column and differentiated using another column.↩︎