Exploring Data Using Pandas

This is the first of four sessions looking at how to explore data in Python. This session will focus on introducing the Python library, pandas. We will use pandas to import, inspect, summarise, and transform the data, illustrating a typical exploratory data analysis workflow.

We are using Australian weather data, taken from Kaggle. This dataset is used to build machine learning models that predict whether it will rain tomorrow, using data about the weather every day from 2007 to 2017. To download the data, click here.

# install necessary packages
# !uv add skimpy

# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from skimpy import skim
# import the dataset
df = pd.read_csv('data/weatherAUS.csv')

Setting the Scene

Before we start to explore any dataset, we need to establish what we are looking to do with the data. This should inform our decisions wwith any exploration, and any analysis that follows.

Questions:

  • What are we trying to achieve?
  • How do our goals impact our analysis?
  • What should we take into consideration before we write any code?
  • What sort of questions might we be interested in with this dataset?

What Our Data Can Tell Us (And What it Can’t)

We also need to consider what the data is and where it came from.

Questions:

  • How was the data collected?
  • What is it missing?
  • What do the variables in our dataset actually mean, and are they a good approximation of the concepts we are interested in?

Exploring the Dataset

First, we should start with dataset-wide operations.

Questions:

  • What do we want to know about a dataset when we first encounter it?
  • How do we get a quick overview of the data that can help us in our next steps?
  • We need to get a “feel” for the data before we can really make any decisions about how to analyse it. How do we get there with a new dataset?

We can start by getting a quick glance at the data. The starting point when you have just imported a new dataset is usually the pandas function pd.DataFrame.head(), which shows the top \(n\) rows of the dataset (by default it shows the top five rows).

# view the top five rows
df.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No No
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ... 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No No
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ... 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No No

5 rows × 23 columns

You can also look at the bottom rows of the dataset, using pd.DataFrame.tail(). This might be useful if you are dealing with time-series data. Below, we specify that we want to look at the bottom ten rows.

# view the bottom ten rows
df.tail(10)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
145450 2017-06-16 Uluru 5.2 24.3 0.0 NaN NaN E 24.0 SE ... 53.0 24.0 1023.8 1020.0 NaN NaN 12.3 23.3 No No
145451 2017-06-17 Uluru 6.4 23.4 0.0 NaN NaN ESE 31.0 S ... 53.0 25.0 1025.8 1023.0 NaN NaN 11.2 23.1 No No
145452 2017-06-18 Uluru 8.0 20.7 0.0 NaN NaN ESE 41.0 SE ... 56.0 32.0 1028.1 1024.3 NaN 7.0 11.6 20.0 No No
145453 2017-06-19 Uluru 7.4 20.6 0.0 NaN NaN E 35.0 ESE ... 63.0 33.0 1027.2 1023.3 NaN NaN 11.0 20.3 No No
145454 2017-06-20 Uluru 3.5 21.8 0.0 NaN NaN E 31.0 ESE ... 59.0 27.0 1024.7 1021.2 NaN NaN 9.4 20.9 No No
145455 2017-06-21 Uluru 2.8 23.4 0.0 NaN NaN E 31.0 SE ... 51.0 24.0 1024.6 1020.3 NaN NaN 10.1 22.4 No No
145456 2017-06-22 Uluru 3.6 25.3 0.0 NaN NaN NNW 22.0 SE ... 56.0 21.0 1023.5 1019.1 NaN NaN 10.9 24.5 No No
145457 2017-06-23 Uluru 5.4 26.9 0.0 NaN NaN N 37.0 SE ... 53.0 24.0 1021.0 1016.8 NaN NaN 12.5 26.1 No No
145458 2017-06-24 Uluru 7.8 27.0 0.0 NaN NaN SE 28.0 SSE ... 51.0 24.0 1019.4 1016.5 3.0 2.0 15.1 26.0 No No
145459 2017-06-25 Uluru 14.9 NaN 0.0 NaN NaN NaN NaN ESE ... 62.0 36.0 1020.2 1017.9 8.0 8.0 15.0 20.9 No NaN

10 rows × 23 columns

A quick glimpse at the data is useful, but we may also want to get quick descriptions of several aspects of the data. Such as the length of the dataset (len(), which can also be used to get the length of various Python objects), which tells us how many observations we have.

# get the object length
len(df)
145460

Another option is pd.DataFrame.shape(), which shows the length (number of rows) and width (number of columns).

# get the object shape (number of rows, number of columns)
df.shape
(145460, 23)

Speaking of columns, if we want a quick list of the column names, we can get this using pd.DataFrame.columns().

# get all column names
df.columns
Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

A quick and easy way to get some valuable information about the dataset is pd.DataFrame.info(), including the total non-null observations and data type1 of each column.

# get dataframe info (column indices, non-null counts, data types)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object 
 22  RainTomorrow   142193 non-null  object 
dtypes: float64(16), object(7)
memory usage: 25.5+ MB

If we wanted to get a better sense of the null values in each column, we could calculate the percentage of null values by capturing whether each row of each column is null (pd.DataFrame.isnull()), summing the total null values in each column (pd.DataFrame.sum()), and then dividing by the length of the dataframe (/len()).

# calculate the percentage of null values in each column
df.isnull().sum()/len(df)
Date             0.000000
Location         0.000000
MinTemp          0.010209
MaxTemp          0.008669
Rainfall         0.022419
Evaporation      0.431665
Sunshine         0.480098
WindGustDir      0.070989
WindGustSpeed    0.070555
WindDir9am       0.072639
WindDir3pm       0.029066
WindSpeed9am     0.012148
WindSpeed3pm     0.021050
Humidity9am      0.018246
Humidity3pm      0.030984
Pressure9am      0.103568
Pressure3pm      0.103314
Cloud9am         0.384216
Cloud3pm         0.408071
Temp9am          0.012148
Temp3pm          0.024811
RainToday        0.022419
RainTomorrow     0.022460
dtype: float64

If we want a quick summary of all the numeric columns in the dataset, we can use pd.DataFrame.describe().

# quick summary of numeric variables
df.describe()
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
count 143975.000000 144199.000000 142199.000000 82670.000000 75625.000000 135197.000000 143693.000000 142398.000000 142806.000000 140953.000000 130395.00000 130432.000000 89572.000000 86102.000000 143693.000000 141851.00000
mean 12.194034 23.221348 2.360918 5.468232 7.611178 40.035230 14.043426 18.662657 68.880831 51.539116 1017.64994 1015.255889 4.447461 4.509930 16.990631 21.68339
std 6.398495 7.119049 8.478060 4.193704 3.785483 13.607062 8.915375 8.809800 19.029164 20.795902 7.10653 7.037414 2.887159 2.720357 6.488753 6.93665
min -8.500000 -4.800000 0.000000 0.000000 0.000000 6.000000 0.000000 0.000000 0.000000 0.000000 980.50000 977.100000 0.000000 0.000000 -7.200000 -5.40000
25% 7.600000 17.900000 0.000000 2.600000 4.800000 31.000000 7.000000 13.000000 57.000000 37.000000 1012.90000 1010.400000 1.000000 2.000000 12.300000 16.60000
50% 12.000000 22.600000 0.000000 4.800000 8.400000 39.000000 13.000000 19.000000 70.000000 52.000000 1017.60000 1015.200000 5.000000 5.000000 16.700000 21.10000
75% 16.900000 28.200000 0.800000 7.400000 10.600000 48.000000 19.000000 24.000000 83.000000 66.000000 1022.40000 1020.000000 7.000000 7.000000 21.600000 26.40000
max 33.900000 48.100000 371.000000 145.000000 14.500000 135.000000 130.000000 87.000000 100.000000 100.000000 1041.00000 1039.600000 9.000000 9.000000 40.200000 46.70000

However, I prefer to bring in another package, skimpy, that does all of this very quickly and cleanly. We can get a detailed description of the entire dataset using skim().

# a more informative summary function from the skimpy package
skim(df)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ Dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 145460 │ │ float64     │ 16    │                                                          │
│ │ Number of columns │ 23     │ │ string      │ 7     │                                                          │
│ └───────────────────┴────────┘ └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━┓  │
│ ┃ column          NA      NA %                 mean   sd     p0     p25   p50   p75   p100  hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━┩  │
│ │ MinTemp         1485 1.020899216279389512.196.398 -8.5 7.6  1216.933.9 ▃▇▇▃  │  │
│ │ MaxTemp         1261 0.866904991062835123.227.119 -4.817.922.628.248.1 ▁▇▇▃  │  │
│ │ Rainfall        3261  2.2418534304963562.3618.478    0   0   0 0.8 371 │  │
│ │ Evaporation    62790   43.16650625601545.4684.194    0 2.6 4.8 7.4 145 │  │
│ │ Sunshine       69835  48.009762133919987.6113.785    0 4.8 8.410.614.5▃▃▅▆▇▃ │  │
│ │ WindGustSpeed  10263  7.05554791695311440.0413.61    6  31  39  48 135 ▂▇▂   │  │
│ │ WindSpeed9am    1767  1.21476694623951614.048.915    0   7  13  19 130  ▇▂   │  │
│ │ WindSpeed3pm    3062  2.10504606077272118.66 8.81    0  13  19  24  87 ▅▇▂   │  │
│ │ Humidity9am     2654 1.824556579128282768.8819.03    0  57  70  83 100 ▁▂▇▇▆ │  │
│ │ Humidity3pm     4507   3.0984463082634451.54 20.8    0  37  52  66 100▁▅▆▇▅▂ │  │
│ │ Pressure9am    15065    10.356799120033 10187.107980.51013101810221041  ▂▇▅  │  │
│ │ Pressure3pm    15028 10.331362573903478 10157.037977.11010101510201040  ▂▇▅  │  │
│ │ Cloud9am       55888  38.421559191530324.4472.887    0   1   5   7   9▇▂▃▂▇▅ │  │
│ │ Cloud3pm       59358  40.80709473394748 4.51 2.72    0   2   5   7   9▆▂▃▂▇▃ │  │
│ │ Temp9am         1767  1.21476694623951616.996.489 -7.212.316.721.640.2 ▂▇▇▃  │  │
│ │ Temp3pm         3609 2.481094458957789221.686.937 -5.416.621.126.446.7 ▁▇▇▃  │  │
│ └────────────────┴────────┴─────────────────────┴───────┴───────┴───────┴──────┴──────┴──────┴──────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓  │
│ ┃                                                                    chars per  words     total     ┃  │
│ ┃ column    NA     NA %      shortest  longest   min       max       row        per row   words     ┃  │
│ ┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩  │
│ │ Date        0       02008-12-2008-12-2007-11-2017-06-       10       1   145460 │  │
│ │          │       │          │ 01      01      01      25       │           │          │           │  │
│ │ Location    0       0Sale    MelbournAdelaideWoomera      8.71       1   145460 │  │
│ │          │       │          │          │ eAirport │          │          │           │          │           │  │
│ │ WindGust103267.098858W       WNW     E       WSW          2.19    0.93   135134 │  │
│ │ Dir      │       │ 79279527 │          │          │          │          │           │          │           │  │
│ │ WindDir9105667.263852W       NNW     E       WSW          2.18    0.93   134894 │  │
│ │ am       │       │ 60552729 │          │          │          │          │           │          │           │  │
│ │          │       │        2 │          │          │          │          │           │          │           │  │
│ │ WindDir3 42282.906641E       WNW     E       WSW          2.21    0.97   141232 │  │
│ │ pm       │       │ 00096246 │          │          │          │          │           │          │           │  │
│ │          │       │        4 │          │          │          │          │           │          │           │  │
│ │ RainToda 32612.241853No      Yes     No      Yes          2.22    0.98   142199 │  │
│ │ y        │       │ 43049635 │          │          │          │          │           │          │           │  │
│ │          │       │        6 │          │          │          │          │           │          │           │  │
│ │ RainTomo 32672.245978No      Yes     No      Yes          2.22    0.98   142193 │  │
│ │ rrow     │       │ 27581465 │          │          │          │          │           │          │           │  │
│ │          │       │        7 │          │          │          │          │           │          │           │  │
│ └──────────┴───────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┴──────────┴───────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

Exploring Variables (Columns) & Observations (Rows)

If we are going to narrow our focus to specific variables or groups of observations, we need to know how to select columns, filter values, and group the data. There are lots of different ways we can slice up the data. We won’t cover all of them here2, but we will try to cover a range that helps illustrate how pandas works and will help you build the intuition for working with data in pandas.

We can select columns in a variety of ways, but the “correct” way to select columns in most circumstances is using selection brackets (the square brackets []), also known as the indexing operator.

# selecting a single column by name
df['Date']

# alternative ways to select columns
# df.loc[:, 'Date']
# df.Date
0         2008-12-01
1         2008-12-02
2         2008-12-03
3         2008-12-04
4         2008-12-05
             ...    
145455    2017-06-21
145456    2017-06-22
145457    2017-06-23
145458    2017-06-24
145459    2017-06-25
Name: Date, Length: 145460, dtype: object

If we want to select multiple columns, we can use double squared brackets ([[ ]]). This is the same process as before, but the inner brackets define a list, and the outer are the selection brackets.

# selecting multiple columns (and all rows) by name
df[['Date', 'Location', 'Rainfall']]
# df.loc[:, ['Date', 'Location', 'Rainfall']]
Date Location Rainfall
0 2008-12-01 Albury 0.6
1 2008-12-02 Albury 0.0
2 2008-12-03 Albury 0.0
3 2008-12-04 Albury 0.0
4 2008-12-05 Albury 1.0
... ... ... ...
145455 2017-06-21 Uluru 0.0
145456 2017-06-22 Uluru 0.0
145457 2017-06-23 Uluru 0.0
145458 2017-06-24 Uluru 0.0
145459 2017-06-25 Uluru 0.0

145460 rows × 3 columns

While selection brackets are a quick and easy solution if we want to grab a subset of variables in the dataset, it is realy only intended to be used for simple operations using only column selection.

For row selection, we should use pd.DataFrame.iloc[]. The iloc function is used for “integer position” selection, which means you can select rows or columns using their integer position. For rows 10-15, you can select them using the following:

# slicing by rows
df.iloc[10:16]
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
10 2008-12-11 Albury 13.4 30.4 0.0 NaN NaN N 30.0 SSE ... 48.0 22.0 1011.8 1008.7 NaN NaN 20.4 28.8 No Yes
11 2008-12-12 Albury 15.9 21.7 2.2 NaN NaN NNE 31.0 NE ... 89.0 91.0 1010.5 1004.2 8.0 8.0 15.9 17.0 Yes Yes
12 2008-12-13 Albury 15.9 18.6 15.6 NaN NaN W 61.0 NNW ... 76.0 93.0 994.3 993.0 8.0 8.0 17.4 15.8 Yes Yes
13 2008-12-14 Albury 12.6 21.0 3.6 NaN NaN SW 44.0 W ... 65.0 43.0 1001.2 1001.8 NaN 7.0 15.8 19.8 Yes No
14 2008-12-15 Albury 8.4 24.6 0.0 NaN NaN NaN NaN S ... 57.0 32.0 1009.7 1008.7 NaN NaN 15.9 23.5 No NaN
15 2008-12-16 Albury 9.8 27.7 NaN NaN NaN WNW 50.0 NaN ... 50.0 28.0 1013.4 1010.3 0.0 NaN 17.3 26.2 NaN No

6 rows × 23 columns

We can do similar using a column’s integer position, but we have to select all rows (:) first:

# using iloc with columns
df.iloc[:, 20]
0         21.8
1         24.3
2         23.2
3         26.5
4         29.7
          ... 
145455    22.4
145456    24.5
145457    26.1
145458    26.0
145459    20.9
Name: Temp3pm, Length: 145460, dtype: float64

Finally, we can put both together to take a subset of both rows and columns:

# using iloc with rows and columns
df.iloc[10:16, 20]
10    28.8
11    17.0
12    15.8
13    19.8
14    23.5
15    26.2
Name: Temp3pm, dtype: float64

However, selecting by integer position is relatively limited. It is more likely we would want to subset the data based on the values of certain columns. We can filter rows by condition using pd.DataFrame.loc[]. The loc function slices by label, instead of integer position.

For example, we might want to look at a subset of the data based on location.

# select all observations in Perth
df.loc[df['Location'] == 'Perth']
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
120638 2008-07-01 Perth 2.7 18.8 0.0 0.8 9.1 ENE 20.0 NaN ... 97.0 53.0 1027.6 1024.5 2.0 3.0 8.5 18.1 No No
120639 2008-07-02 Perth 6.4 20.7 0.0 1.8 7.0 NE 22.0 ESE ... 80.0 39.0 1024.1 1019.0 0.0 6.0 11.1 19.7 No No
120640 2008-07-03 Perth 6.5 19.9 0.4 2.2 7.3 NE 31.0 NaN ... 84.0 71.0 1016.8 1015.6 1.0 3.0 12.1 17.7 No Yes
120641 2008-07-04 Perth 9.5 19.2 1.8 1.2 4.7 W 26.0 NNE ... 93.0 73.0 1019.3 1018.4 6.0 6.0 13.2 17.7 Yes Yes
120642 2008-07-05 Perth 9.5 16.4 1.8 1.4 4.9 WSW 44.0 W ... 69.0 57.0 1020.4 1022.1 7.0 5.0 15.9 16.0 Yes Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123826 2017-06-21 Perth 10.3 19.9 0.2 1.8 7.5 NW 37.0 NNE ... 89.0 60.0 1017.1 1013.8 5.0 6.0 13.0 18.5 No Yes
123827 2017-06-22 Perth 13.0 16.8 61.2 3.6 0.0 SSW 46.0 W ... 90.0 75.0 1005.6 1008.9 7.0 7.0 16.4 15.6 Yes No
123828 2017-06-23 Perth 13.3 18.9 0.4 1.8 6.5 SE 37.0 SE ... 85.0 65.0 1019.2 1019.4 6.0 6.0 15.1 18.0 No No
123829 2017-06-24 Perth 11.5 18.2 0.0 3.8 9.3 SE 30.0 ESE ... 62.0 47.0 1025.9 1023.4 1.0 3.0 14.0 17.6 No No
123830 2017-06-25 Perth 6.3 17.0 0.0 1.6 7.9 E 26.0 SE ... 75.0 49.0 1028.6 1026.0 1.0 3.0 11.5 15.6 No No

3193 rows × 23 columns

We can also filter by multiple values, such as location and rainfall.

df.loc[(df['Rainfall'] == 0) & (df['Location'] == 'Perth')]
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
120638 2008-07-01 Perth 2.7 18.8 0.0 0.8 9.1 ENE 20.0 NaN ... 97.0 53.0 1027.6 1024.5 2.0 3.0 8.5 18.1 No No
120639 2008-07-02 Perth 6.4 20.7 0.0 1.8 7.0 NE 22.0 ESE ... 80.0 39.0 1024.1 1019.0 0.0 6.0 11.1 19.7 No No
120644 2008-07-07 Perth 0.7 18.3 0.0 0.8 9.3 N 37.0 NE ... 72.0 36.0 1028.9 1024.2 1.0 5.0 8.7 17.9 No No
120645 2008-07-08 Perth 3.2 20.4 0.0 1.4 6.9 NNW 24.0 NE ... 58.0 42.0 1023.9 1021.1 6.0 5.0 10.2 19.3 No Yes
120651 2008-07-14 Perth 7.9 19.7 0.0 0.2 6.5 NE 31.0 NE ... 86.0 41.0 1026.0 1021.9 6.0 5.0 11.7 18.7 No No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123823 2017-06-18 Perth 7.5 23.4 0.0 1.8 9.2 NNE 28.0 ENE ... 67.0 41.0 1026.9 1022.9 0.0 0.0 14.2 22.2 No No
123824 2017-06-19 Perth 5.5 23.0 0.0 3.0 9.1 SW 19.0 ENE ... 84.0 55.0 1023.0 1020.3 1.0 2.0 11.5 22.0 No No
123825 2017-06-20 Perth 7.8 22.5 0.0 2.8 9.1 NW 26.0 W ... 98.0 59.0 1019.3 1015.9 1.0 1.0 13.5 21.6 No No
123829 2017-06-24 Perth 11.5 18.2 0.0 3.8 9.3 SE 30.0 ESE ... 62.0 47.0 1025.9 1023.4 1.0 3.0 14.0 17.6 No No
123830 2017-06-25 Perth 6.3 17.0 0.0 1.6 7.9 E 26.0 SE ... 75.0 49.0 1028.6 1026.0 1.0 3.0 11.5 15.6 No No

2293 rows × 23 columns

For any complex process for subsetting the data, including multiple conditions, pd.DataFrame.loc[] is the best bet.

Summarising Data

Now that we know how to select the variables or observations we are interested in, we can start doing some descriptive analysis. The operations we use will depend on the questions we are trying to answer, and the possibilities will be almost endless.

Questions:

  • What “functions” might we need to carry out on our data when we are exploring it?

We know that the weather data includes observations from all over the country, but we might want to check exactly how many different locations there are. We can use pd.DataFrame.nunique() to do this.

# count unique values
df['Location'].nunique()
49

We may also be interested in the locations themselves, which may tell us more about the spatial distribution of our data. In this case, we can use pd.DataFrame.unique().

# get unique values
df['Location'].unique()
array(['Albury', 'BadgerysCreek', 'Cobar', 'CoffsHarbour', 'Moree',
       'Newcastle', 'NorahHead', 'NorfolkIsland', 'Penrith', 'Richmond',
       'Sydney', 'SydneyAirport', 'WaggaWagga', 'Williamtown',
       'Wollongong', 'Canberra', 'Tuggeranong', 'MountGinini', 'Ballarat',
       'Bendigo', 'Sale', 'MelbourneAirport', 'Melbourne', 'Mildura',
       'Nhil', 'Portland', 'Watsonia', 'Dartmoor', 'Brisbane', 'Cairns',
       'GoldCoast', 'Townsville', 'Adelaide', 'MountGambier', 'Nuriootpa',
       'Woomera', 'Albany', 'Witchcliffe', 'PearceRAAF', 'PerthAirport',
       'Perth', 'SalmonGums', 'Walpole', 'Hobart', 'Launceston',
       'AliceSprings', 'Darwin', 'Katherine', 'Uluru'], dtype=object)

Another common operation we might look to do is calculating the mean value (pd.DataFrame.mean()) of a certain variable. What is the average value of sunshine across the entire dataset?

# calculate variable mean
df['Sunshine'].mean()
np.float64(7.611177520661157)

This gives us the mean to many decimal places, and we probably don’t need to know the average sunshine hours to this level of precision. We can use the pd.DataFrame.round() function to round to two decimal places.

# round mean value
df['Sunshine'].mean().round(2)
np.float64(7.61)

Many operations will return the value with information about the object’s type included. The above values are wrapped in np.float64() because pd.DataFrame.mean() uses numpy to calculate the mean value. However, if you want to strip this information out so you only see the value itself, you can use print().

While we are often interested in the mean value when we talk about averages, we might want to know the median instead (pd.DataFrame.median()).

# calculate other summary statistics
print(df['Sunshine'].median())
8.4

Another common calculation is summing values (pd.DataFrame.sum()). We can use sum() to see the total hours of sunshine in our dataset, and we can use int() to convert this value to an integer (which also means we don’t need to use print()3).

# calculate sum value and return an integer
int(df['Sunshine'].sum())
575595

We can also apply these summary operations on multiple variables, using the same selection logic as before (using double squared brackets).

print(df[['Sunshine', 'Rainfall']].mean())
Sunshine    7.611178
Rainfall    2.360918
dtype: float64

And we can apply multiple functions, using pd.DataFrame.agg().

df['Sunshine'].agg(['mean', 'median', 'sum']).round(1)
mean           7.6
median         8.4
sum       575595.3
Name: Sunshine, dtype: float64

The next step when exploring specific variables will often be group-level summaries. The average amount of sunshine across the whole dataset has limited utility, but the average hours of sunshine in each location allows us to compare between locations and start to understand how different variables are related to each other. If we want to do a group-level operation, we have to use pd.DataFrame.groupby().

# calculate group means
df.groupby(by='Location')['Sunshine'].mean().round(1)
Location
Adelaide            7.7
Albany              6.7
Albury              NaN
AliceSprings        9.6
BadgerysCreek       NaN
Ballarat            NaN
Bendigo             NaN
Brisbane            8.1
Cairns              7.6
Canberra            7.4
Cobar               8.7
CoffsHarbour        7.4
Dartmoor            6.5
Darwin              8.5
GoldCoast           NaN
Hobart              6.6
Katherine           NaN
Launceston          NaN
Melbourne           6.4
MelbourneAirport    6.4
Mildura             8.5
Moree               8.9
MountGambier        6.5
MountGinini         NaN
Newcastle           NaN
Nhil                NaN
NorahHead           NaN
NorfolkIsland       7.0
Nuriootpa           7.7
PearceRAAF          8.8
Penrith             NaN
Perth               8.8
PerthAirport        8.8
Portland            6.5
Richmond            NaN
Sale                6.7
SalmonGums          NaN
Sydney              7.2
SydneyAirport       7.2
Townsville          8.5
Tuggeranong         NaN
Uluru               NaN
WaggaWagga          8.2
Walpole             NaN
Watsonia            6.4
Williamtown         7.2
Witchcliffe         NaN
Wollongong          NaN
Woomera             9.0
Name: Sunshine, dtype: float64

The groupby(by='Location') function tells us the grouping variable (location), then we select the variable we want to summarise by location (sunshine), and then we specify the operation (mean).

There are multiple locations that return NaN (Not a Number). This indicates that numpy was unable to calculate a mean value for those locations. This is likely to be because all sunshine values for those locations are null.

We can check this using pd.DataFrame.count(), which counts the total non-null values (whereas pd.DataFrame.size() counts the total values).

# group by location and count non-null sunshine values
df.groupby('Location')['Sunshine'].count()
Location
Adelaide            1769
Albany              2520
Albury                 0
AliceSprings        2520
BadgerysCreek          0
Ballarat               0
Bendigo                0
Brisbane            3144
Cairns              2564
Canberra            1521
Cobar                550
CoffsHarbour        1494
Dartmoor            2566
Darwin              3189
GoldCoast              0
Hobart              3179
Katherine              0
Launceston             0
Melbourne           3192
MelbourneAirport    3008
Mildura             2876
Moree               2055
MountGambier        2597
MountGinini            0
Newcastle              0
Nhil                   0
NorahHead              0
NorfolkIsland       2570
Nuriootpa           2848
PearceRAAF          3004
Penrith                0
Perth               3188
PerthAirport        3004
Portland            2566
Richmond               0
Sale                1818
SalmonGums             0
Sydney              3328
SydneyAirport       2993
Townsville          2617
Tuggeranong            0
Uluru                  0
WaggaWagga          2575
Walpole                0
Watsonia            3008
Williamtown         1355
Witchcliffe            0
Wollongong             0
Woomera             2007
Name: Sunshine, dtype: int64

The results show that all the locations that return NaN in our group mean calculation have zero non-null values.

Transforming Data

Datasets are rarely perfectly clean and tidy. We often need to transform the data before we can get the most out of it.

Questions:

  • What sort of transformations would help us get the most out of the analysis of the Australian weather data?

The first step with any analysis is often converting columns to the correct types. With a longitudinal (time-series) dataset,the date column is a good place to start. We can use pd.DataFrame.dtypes to check the data type, either of a single column (using the selector brackets) or all columns in the dataset.

print(df.dtypes)
Date              object
Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object

All columns are either stored as object or float64. The object data type is for generic non-numeric data, but from the columns that are stored as objects, we can tell this is mostly categorical variables where the categories are represented as text. The float64 data type refers to data that is numeric and includes decimals (float64 = 64-bit floating point number).

The date column is stored as an object, but pandas can store dates as datetime64. We can convert dates using pd.to_datetime(). When transforming data, if we want to keep those transformations, we have to store those changes, using =. In this case, we want to convert the date column but we don’t want to create an entirely new dataframe to handle this change, so we can overwrite the current date column by using the selection brackets to identify the column we want to apply this change to.

# convert date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

The remaining object columns can be converted to categorical, which makes them easier to work with in subsequent analyses. We can use pd.DataFrame.astype() to convert column data types.

# create a list of all object columns
object_cols = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

# convert object columns to category
df[object_cols] = df[object_cols].astype('category')

A more efficient, though synactically more complex, way of doing this is using lamda functions. We won’t cover lambda functons in this session (they will be discussed in detail in a future session), but below is how we can use them to convert objects to categories.

# convert object columns to category data type
df = df.apply(lambda x: x.astype('category') if x.dtype == 'object' else x)

Another choice we might make is to remove missing values, using pd.DataFrame.dropna() to filter the null values and keep only the non-null values. We can use this to drop all null values across the entire dataset, or we can apply it to a subset of columns, using the subset argument.

# filter observations where sunshine is NA
df.dropna(subset='Sunshine')
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
6049 2009-01-01 Cobar 17.9 35.2 0.0 12.0 12.3 SSW 48.0 ENE ... 20.0 13.0 1006.3 1004.4 2.0 5.0 26.6 33.4 No No
6050 2009-01-02 Cobar 18.4 28.9 0.0 14.8 13.0 S 37.0 SSE ... 30.0 8.0 1012.9 1012.1 1.0 1.0 20.3 27.0 No No
6051 2009-01-03 Cobar 15.5 34.1 0.0 12.6 13.3 SE 30.0 NaN ... NaN 7.0 NaN 1011.6 NaN 1.0 NaN 32.7 No No
6052 2009-01-04 Cobar 19.4 37.6 0.0 10.8 10.6 NNE 46.0 NNE ... 42.0 22.0 1012.3 1009.2 1.0 6.0 28.7 34.9 No No
6053 2009-01-05 Cobar 21.9 38.4 0.0 11.4 12.2 WNW 31.0 WNW ... 37.0 22.0 1012.7 1009.1 1.0 5.0 29.1 35.6 No No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
142298 2017-06-20 Darwin 19.3 33.4 0.0 6.0 11.0 ENE 35.0 SE ... 63.0 32.0 1013.9 1010.5 0.0 1.0 24.5 32.3 No No
142299 2017-06-21 Darwin 21.2 32.6 0.0 7.6 8.6 E 37.0 SE ... 56.0 28.0 1014.6 1011.2 7.0 0.0 24.8 32.0 No No
142300 2017-06-22 Darwin 20.7 32.8 0.0 5.6 11.0 E 33.0 E ... 46.0 23.0 1015.3 1011.8 0.0 0.0 24.8 32.1 No No
142301 2017-06-23 Darwin 19.5 31.8 0.0 6.2 10.6 ESE 26.0 SE ... 62.0 58.0 1014.9 1010.7 1.0 1.0 24.8 29.2 No No
142302 2017-06-24 Darwin 20.2 31.7 0.0 5.6 10.7 ENE 30.0 ENE ... 73.0 32.0 1013.9 1009.7 6.0 5.0 25.4 31.0 No No

75625 rows × 23 columns

We haven’t stored this transformation, because filtering nulls without careful consideration is a bad idea, but it’s useful to know, nonetheless.

There are lots of ways we could transform the data, but the final example we will consider here is reshaping the data using pd.DataFrame.pivot(), which transforms the data from long to wide format data, and pd.DataFrame.melt(), which transforms it from wide to long format.

Perhaps we want to focus on the maximum temperature per day in each location in 2015. We can use pd.Series.dt.year to get the year from the date column, and filter for the year 2015, before reshaping the data.

df2015 = df.loc[df['Date'].dt.year == 2015]
df_wide = df2015.pivot(index='Date', columns='Location', values='MaxTemp')

df_wide.head()
Location Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra ... Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera
Date
2015-01-01 37.0 21.9 33.5 40.3 34.7 27.4 31.2 31.3 33.7 32.6 ... 32.6 32.1 42.0 35.2 23.6 28.3 33.7 25.0 25.3 39.2
2015-01-02 44.1 21.2 39.6 41.4 30.5 38.2 39.8 30.5 33.7 35.2 ... 33.0 34.1 42.4 38.9 21.1 40.6 29.3 23.6 24.6 43.3
2015-01-03 38.2 21.5 38.3 36.4 34.3 37.5 40.3 28.9 33.6 34.7 ... 28.1 33.7 39.8 37.5 21.8 39.5 32.8 23.0 25.7 44.7
2015-01-04 30.5 23.3 33.1 29.0 34.8 23.5 29.0 30.2 29.4 32.5 ... 31.6 32.8 36.1 33.8 24.4 25.1 34.5 29.8 25.3 37.6
2015-01-05 34.9 24.9 35.2 27.1 27.2 26.6 33.6 28.1 31.4 29.6 ... 31.6 28.9 38.8 34.9 29.5 25.7 27.0 31.7 23.1 38.3

5 rows × 49 columns

Perhaps we want to look at the maximum and minimum temperatures in each location, together. We can reshape the data to support this4.

df_long = df2015.melt(
    id_vars=['Date', 'Location'],
    value_vars=['MaxTemp', 'MinTemp'],
    var_name='Variable',
    value_name='Value'
)

df_long.head()
Date Location Variable Value
0 2015-01-01 Albury MaxTemp 33.5
1 2015-01-02 Albury MaxTemp 39.6
2 2015-01-03 Albury MaxTemp 38.3
3 2015-01-04 Albury MaxTemp 33.1
4 2015-01-05 Albury MaxTemp 35.2

Exercises

Some of these questions are easily answered by scrolling up and finding the answer in the output of the above code, however, the goal is to find the answer using code. No one actually cares what the answer to any of these questions is, it’s the process that matters!

Remember, if you don’t know the answer, it’s okay to Google it (or speak to others, including me, for help)!

Import Data (to Reset)
# import the dataset
df = pd.read_csv('data/weatherAUS.csv')
  1. What is the ‘Sunshine’ column’s data type?
# What is the 'Sunshine' column's data type?
print(df['Sunshine'].dtypes)
float64
  1. Identify all the columns that are of dtype ‘object’.
# Identify all the columns that are of dtype 'object'
print(list(df.select_dtypes(include=['object'])))
['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
  1. How many of the dataframe’s columns are of dtype ‘object’?
# How many of the dataframe's columns are of dtype 'object'?
len(list(df.select_dtypes(include=['object'])))
7
  1. How many of the ‘Rainfall’ column values are NAs?
# How many of the 'Rainfall' column values are NAs?
print(df['Rainfall'].isna().sum())
3261
  1. Create a new dataframe which only includes the ‘Date’, ‘Location, ’Sunshine’, ‘Rainfall’, and ‘RainTomorrow’ columns.
new_df = df[['Date', 'Location', 'Sunshine', 'Rainfall', 'RainTomorrow']]
new_df.head()
Date Location Sunshine Rainfall RainTomorrow
0 2008-12-01 Albury NaN 0.6 No
1 2008-12-02 Albury NaN 0.0 No
2 2008-12-03 Albury NaN 0.0 No
3 2008-12-04 Albury NaN 0.0 No
4 2008-12-05 Albury NaN 1.0 No
  1. Convert ‘RainTomorrow’ to a numeric variable, where ‘Yes’ = 1 and ‘No’ = 0.
# df['Location'].astype('category').cat.codes
# df['RainTomorrow'].astype('category').cat.codes
df['RainTomorrow'].map({'Yes': 1, 'No': 0})
0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
145455    0.0
145456    0.0
145457    0.0
145458    0.0
145459    NaN
Name: RainTomorrow, Length: 145460, dtype: float64
  1. What is the average amount of rainfall for each location?
# average rainfall by location, sorted by value
df.groupby('Location')['Rainfall'].mean().sort_values(ascending=False)
Location
Cairns              5.742035
Darwin              5.092452
CoffsHarbour        5.061497
GoldCoast           3.769396
Wollongong          3.594903
Williamtown         3.591108
Townsville          3.485592
NorahHead           3.387299
Sydney              3.324543
MountGinini         3.292260
Katherine           3.201090
Newcastle           3.183892
Brisbane            3.144891
NorfolkIsland       3.127665
SydneyAirport       3.009917
Walpole             2.906846
Witchcliffe         2.895664
Portland            2.530374
Albany              2.263859
BadgerysCreek       2.193101
Penrith             2.175304
Tuggeranong         2.164043
Dartmoor            2.146567
Richmond            2.138462
MountGambier        2.087562
Launceston          2.011988
Albury              1.914115
Perth               1.906295
Melbourne           1.870062
Watsonia            1.860820
PerthAirport        1.761648
Canberra            1.741720
Ballarat            1.740026
WaggaWagga          1.709946
PearceRAAF          1.669080
Moree               1.630203
Bendigo             1.619380
Hobart              1.601819
Adelaide            1.566354
Sale                1.510167
MelbourneAirport    1.451977
Nuriootpa           1.390343
Cobar               1.127309
SalmonGums          1.034382
Mildura             0.945062
Nhil                0.934863
AliceSprings        0.882850
Uluru               0.784363
Woomera             0.490405
Name: Rainfall, dtype: float64
  1. What is the average amount of rainfall for days that it will rain tomorrow?
# average rainfall depending on whether it will rain tomorrow or not
df.groupby('RainTomorrow')['Rainfall'].mean()
RainTomorrow
No     1.270290
Yes    6.142104
Name: Rainfall, dtype: float64
  1. What is the average amount of sunshine in Perth when it will not rain tomorrow?
# average sunshine in Perth when it won't rain tomorrow
df.loc[(df['Location'] == 'Perth') & (df['RainTomorrow'] == 'No'), 'Sunshine'].mean()
# df[(df['Location']=='Perth') & (df['RainTomorrow']=='No')]['Sunshine'].mean()
np.float64(9.705306603773584)
  1. We want to understand the role that time plays in the dataset. Using the original dataframe, carry the following tasks and answer the corresponding questions:
    • Create columns representing the year and month from the ‘Date’ column. How many years of data are in the dataset?
    • Examine the distribution of the ‘Sunshine’ NAs over time. Is time a component in the ‘Sunshine’ data quality issues?
    • Calculate the average rainfall and sunshine by month. How do rainfall and sunshine vary through the year?
    • Calculate the average rainfall and sunshine by year. How have rainfall and sunshine changed over time?
# get year and month columns
df = (
    df.assign(Date=pd.to_datetime(df['Date']))
    .assign(
        Year=lambda x: x['Date'].dt.year,
        Month=lambda x: x['Date'].dt.month
    )
)

# count unique years
df['Year'].nunique()
11
# lambda function counting nulls by year
df.groupby('Year')['Sunshine'].apply(lambda x: x.isna().sum())
Year
2007        0
2008      323
2009     6146
2010     6220
2011     6053
2012     6539
2013     7570
2014     9157
2015     9441
2016    11994
2017     6392
Name: Sunshine, dtype: int64
# rainfall and sunshine by month
df.groupby('Month')[['Rainfall', 'Sunshine']].mean().round(1)
Rainfall Sunshine
Month
1 2.7 9.2
2 3.2 8.6
3 2.8 7.6
4 2.3 7.1
5 2.0 6.3
6 2.8 5.6
7 2.2 6.1
8 2.0 7.1
9 1.9 7.7
10 1.6 8.5
11 2.3 8.7
12 2.5 9.0
# rainfall and sunshine by year
df.groupby('Year')[['Rainfall', 'Sunshine']].mean().round(1)
Rainfall Sunshine
Year
2007 3.2 8.1
2008 2.3 7.8
2009 2.2 7.9
2010 2.7 7.3
2011 2.8 7.3
2012 2.4 7.6
2013 2.3 7.7
2014 2.0 7.8
2015 2.2 7.7
2016 2.4 7.6
2017 2.5 7.7

Footnotes

  1. For more information about pandas data types, check out the pandas documentation on dtypes.↩︎

  2. For more information, I’d recommend the pandas documentation, and this pandas tutorial on subsetting data.↩︎

  3. Some functions should be wrapped in print() in order to return a value that is easy to read, but others won’t. There will be an internal logic for which is which, but it’s not of huge importance to us. You are better off just testing functions out and wrapping them in print() if necessary.↩︎

  4. This is often very useful when we need to visualise data, for example plotting the max and min temp for each location, is easier if the values are organised in the same column and differentiated using another column.↩︎