# install necessary packages
# !uv add skimpy
# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from skimpy import skim
Exploring Data Using Pandas
This is the first of four sessions looking at how to explore data in Python. This session will focus on introducing the Python library, pandas. We will use pandas to import, inspect, summarise, and transform the data, illustrating a typical exploratory data analysis workflow.
We are using Australian weather data, taken from Kaggle. This dataset is used to build machine learning models that predict whether it will rain tomorrow, using data about the weather every day from 2007 to 2017. To download the data, click here.
# import the dataset
= pd.read_csv('data/weatherAUS.csv') df
Setting the Scene
Before we start to explore any dataset, we need to establish what we are looking to do with the data. This should inform our decisions wwith any exploration, and any analysis that follows.
Questions:
- What are we trying to achieve?
- How do our goals impact our analysis?
- What should we take into consideration before we write any code?
- What sort of questions might we be interested in with this dataset?
What Our Data Can Tell Us (And What it Can’t)
We also need to consider what the data is and where it came from.
Questions:
- How was the data collected?
- What is it missing?
- What do the variables in our dataset actually mean, and are they a good approximation of the concepts we are interested in?
Exploring the Dataset
First, we should start with dataset-wide operations.
Questions:
- What do we want to know about a dataset when we first encounter it?
- How do we get a quick overview of the data that can help us in our next steps?
- We need to get a “feel” for the data before we can really make any decisions about how to analyse it. How do we get there with a new dataset?
We can start by getting a quick glance at the data. The starting point when you have just imported a new dataset is usually the pandas function pd.DataFrame.head()
, which shows the top \(n\) rows of the dataset (by default it shows the top five rows).
# view the top five rows
df.head()
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-12-01 | Albury | 13.4 | 22.9 | 0.6 | NaN | NaN | W | 44.0 | W | ... | 71.0 | 22.0 | 1007.7 | 1007.1 | 8.0 | NaN | 16.9 | 21.8 | No | No |
1 | 2008-12-02 | Albury | 7.4 | 25.1 | 0.0 | NaN | NaN | WNW | 44.0 | NNW | ... | 44.0 | 25.0 | 1010.6 | 1007.8 | NaN | NaN | 17.2 | 24.3 | No | No |
2 | 2008-12-03 | Albury | 12.9 | 25.7 | 0.0 | NaN | NaN | WSW | 46.0 | W | ... | 38.0 | 30.0 | 1007.6 | 1008.7 | NaN | 2.0 | 21.0 | 23.2 | No | No |
3 | 2008-12-04 | Albury | 9.2 | 28.0 | 0.0 | NaN | NaN | NE | 24.0 | SE | ... | 45.0 | 16.0 | 1017.6 | 1012.8 | NaN | NaN | 18.1 | 26.5 | No | No |
4 | 2008-12-05 | Albury | 17.5 | 32.3 | 1.0 | NaN | NaN | W | 41.0 | ENE | ... | 82.0 | 33.0 | 1010.8 | 1006.0 | 7.0 | 8.0 | 17.8 | 29.7 | No | No |
5 rows × 23 columns
You can also look at the bottom rows of the dataset, using pd.DataFrame.tail()
. This might be useful if you are dealing with time-series data. Below, we specify that we want to look at the bottom ten rows.
# view the bottom ten rows
10) df.tail(
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
145450 | 2017-06-16 | Uluru | 5.2 | 24.3 | 0.0 | NaN | NaN | E | 24.0 | SE | ... | 53.0 | 24.0 | 1023.8 | 1020.0 | NaN | NaN | 12.3 | 23.3 | No | No |
145451 | 2017-06-17 | Uluru | 6.4 | 23.4 | 0.0 | NaN | NaN | ESE | 31.0 | S | ... | 53.0 | 25.0 | 1025.8 | 1023.0 | NaN | NaN | 11.2 | 23.1 | No | No |
145452 | 2017-06-18 | Uluru | 8.0 | 20.7 | 0.0 | NaN | NaN | ESE | 41.0 | SE | ... | 56.0 | 32.0 | 1028.1 | 1024.3 | NaN | 7.0 | 11.6 | 20.0 | No | No |
145453 | 2017-06-19 | Uluru | 7.4 | 20.6 | 0.0 | NaN | NaN | E | 35.0 | ESE | ... | 63.0 | 33.0 | 1027.2 | 1023.3 | NaN | NaN | 11.0 | 20.3 | No | No |
145454 | 2017-06-20 | Uluru | 3.5 | 21.8 | 0.0 | NaN | NaN | E | 31.0 | ESE | ... | 59.0 | 27.0 | 1024.7 | 1021.2 | NaN | NaN | 9.4 | 20.9 | No | No |
145455 | 2017-06-21 | Uluru | 2.8 | 23.4 | 0.0 | NaN | NaN | E | 31.0 | SE | ... | 51.0 | 24.0 | 1024.6 | 1020.3 | NaN | NaN | 10.1 | 22.4 | No | No |
145456 | 2017-06-22 | Uluru | 3.6 | 25.3 | 0.0 | NaN | NaN | NNW | 22.0 | SE | ... | 56.0 | 21.0 | 1023.5 | 1019.1 | NaN | NaN | 10.9 | 24.5 | No | No |
145457 | 2017-06-23 | Uluru | 5.4 | 26.9 | 0.0 | NaN | NaN | N | 37.0 | SE | ... | 53.0 | 24.0 | 1021.0 | 1016.8 | NaN | NaN | 12.5 | 26.1 | No | No |
145458 | 2017-06-24 | Uluru | 7.8 | 27.0 | 0.0 | NaN | NaN | SE | 28.0 | SSE | ... | 51.0 | 24.0 | 1019.4 | 1016.5 | 3.0 | 2.0 | 15.1 | 26.0 | No | No |
145459 | 2017-06-25 | Uluru | 14.9 | NaN | 0.0 | NaN | NaN | NaN | NaN | ESE | ... | 62.0 | 36.0 | 1020.2 | 1017.9 | 8.0 | 8.0 | 15.0 | 20.9 | No | NaN |
10 rows × 23 columns
A quick glimpse at the data is useful, but we may also want to get quick descriptions of several aspects of the data. Such as the length of the dataset (len()
, which can also be used to get the length of various Python objects), which tells us how many observations we have.
# get the object length
len(df)
145460
Another option is pd.DataFrame.shape()
, which shows the length (number of rows) and width (number of columns).
# get the object shape (number of rows, number of columns)
df.shape
(145460, 23)
Speaking of columns, if we want a quick list of the column names, we can get this using pd.DataFrame.columns()
.
# get all column names
df.columns
Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RainTomorrow'],
dtype='object')
A quick and easy way to get some valuable information about the dataset is pd.DataFrame.info()
, including the total non-null observations and data type1 of each column.
# get dataframe info (column indices, non-null counts, data types)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
If we wanted to get a better sense of the null values in each column, we could calculate the percentage of null values by capturing whether each row of each column is null (pd.DataFrame.isnull()
), summing the total null values in each column (pd.DataFrame.sum()
), and then dividing by the length of the dataframe (/len()
).
# calculate the percentage of null values in each column
sum()/len(df) df.isnull().
Date 0.000000
Location 0.000000
MinTemp 0.010209
MaxTemp 0.008669
Rainfall 0.022419
Evaporation 0.431665
Sunshine 0.480098
WindGustDir 0.070989
WindGustSpeed 0.070555
WindDir9am 0.072639
WindDir3pm 0.029066
WindSpeed9am 0.012148
WindSpeed3pm 0.021050
Humidity9am 0.018246
Humidity3pm 0.030984
Pressure9am 0.103568
Pressure3pm 0.103314
Cloud9am 0.384216
Cloud3pm 0.408071
Temp9am 0.012148
Temp3pm 0.024811
RainToday 0.022419
RainTomorrow 0.022460
dtype: float64
If we want a quick summary of all the numeric columns in the dataset, we can use pd.DataFrame.describe()
.
# quick summary of numeric variables
df.describe()
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | WindSpeed9am | WindSpeed3pm | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 143975.000000 | 144199.000000 | 142199.000000 | 82670.000000 | 75625.000000 | 135197.000000 | 143693.000000 | 142398.000000 | 142806.000000 | 140953.000000 | 130395.00000 | 130432.000000 | 89572.000000 | 86102.000000 | 143693.000000 | 141851.00000 |
mean | 12.194034 | 23.221348 | 2.360918 | 5.468232 | 7.611178 | 40.035230 | 14.043426 | 18.662657 | 68.880831 | 51.539116 | 1017.64994 | 1015.255889 | 4.447461 | 4.509930 | 16.990631 | 21.68339 |
std | 6.398495 | 7.119049 | 8.478060 | 4.193704 | 3.785483 | 13.607062 | 8.915375 | 8.809800 | 19.029164 | 20.795902 | 7.10653 | 7.037414 | 2.887159 | 2.720357 | 6.488753 | 6.93665 |
min | -8.500000 | -4.800000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 980.50000 | 977.100000 | 0.000000 | 0.000000 | -7.200000 | -5.40000 |
25% | 7.600000 | 17.900000 | 0.000000 | 2.600000 | 4.800000 | 31.000000 | 7.000000 | 13.000000 | 57.000000 | 37.000000 | 1012.90000 | 1010.400000 | 1.000000 | 2.000000 | 12.300000 | 16.60000 |
50% | 12.000000 | 22.600000 | 0.000000 | 4.800000 | 8.400000 | 39.000000 | 13.000000 | 19.000000 | 70.000000 | 52.000000 | 1017.60000 | 1015.200000 | 5.000000 | 5.000000 | 16.700000 | 21.10000 |
75% | 16.900000 | 28.200000 | 0.800000 | 7.400000 | 10.600000 | 48.000000 | 19.000000 | 24.000000 | 83.000000 | 66.000000 | 1022.40000 | 1020.000000 | 7.000000 | 7.000000 | 21.600000 | 26.40000 |
max | 33.900000 | 48.100000 | 371.000000 | 145.000000 | 14.500000 | 135.000000 | 130.000000 | 87.000000 | 100.000000 | 100.000000 | 1041.00000 | 1039.600000 | 9.000000 | 9.000000 | 40.200000 | 46.70000 |
However, I prefer to bring in another package, skimpy, that does all of this very quickly and cleanly. We can get a detailed description of the entire dataset using skim()
.
# a more informative summary function from the skimpy package
skim(df)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ Dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 145460 │ │ float64 │ 16 │ │ │ │ Number of columns │ 23 │ │ string │ 7 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━┩ │ │ │ MinTemp │ 1485 │ 1.0208992162793895 │ 12.19 │ 6.398 │ -8.5 │ 7.6 │ 12 │ 16.9 │ 33.9 │ ▃▇▇▃ │ │ │ │ MaxTemp │ 1261 │ 0.8669049910628351 │ 23.22 │ 7.119 │ -4.8 │ 17.9 │ 22.6 │ 28.2 │ 48.1 │ ▁▇▇▃ │ │ │ │ Rainfall │ 3261 │ 2.241853430496356 │ 2.361 │ 8.478 │ 0 │ 0 │ 0 │ 0.8 │ 371 │ ▇ │ │ │ │ Evaporation │ 62790 │ 43.1665062560154 │ 5.468 │ 4.194 │ 0 │ 2.6 │ 4.8 │ 7.4 │ 145 │ ▇ │ │ │ │ Sunshine │ 69835 │ 48.00976213391998 │ 7.611 │ 3.785 │ 0 │ 4.8 │ 8.4 │ 10.6 │ 14.5 │ ▃▃▅▆▇▃ │ │ │ │ WindGustSpeed │ 10263 │ 7.055547916953114 │ 40.04 │ 13.61 │ 6 │ 31 │ 39 │ 48 │ 135 │ ▂▇▂ │ │ │ │ WindSpeed9am │ 1767 │ 1.214766946239516 │ 14.04 │ 8.915 │ 0 │ 7 │ 13 │ 19 │ 130 │ ▇▂ │ │ │ │ WindSpeed3pm │ 3062 │ 2.105046060772721 │ 18.66 │ 8.81 │ 0 │ 13 │ 19 │ 24 │ 87 │ ▅▇▂ │ │ │ │ Humidity9am │ 2654 │ 1.8245565791282827 │ 68.88 │ 19.03 │ 0 │ 57 │ 70 │ 83 │ 100 │ ▁▂▇▇▆ │ │ │ │ Humidity3pm │ 4507 │ 3.09844630826344 │ 51.54 │ 20.8 │ 0 │ 37 │ 52 │ 66 │ 100 │ ▁▅▆▇▅▂ │ │ │ │ Pressure9am │ 15065 │ 10.356799120033 │ 1018 │ 7.107 │ 980.5 │ 1013 │ 1018 │ 1022 │ 1041 │ ▂▇▅ │ │ │ │ Pressure3pm │ 15028 │ 10.331362573903478 │ 1015 │ 7.037 │ 977.1 │ 1010 │ 1015 │ 1020 │ 1040 │ ▂▇▅ │ │ │ │ Cloud9am │ 55888 │ 38.42155919153032 │ 4.447 │ 2.887 │ 0 │ 1 │ 5 │ 7 │ 9 │ ▇▂▃▂▇▅ │ │ │ │ Cloud3pm │ 59358 │ 40.80709473394748 │ 4.51 │ 2.72 │ 0 │ 2 │ 5 │ 7 │ 9 │ ▆▂▃▂▇▃ │ │ │ │ Temp9am │ 1767 │ 1.214766946239516 │ 16.99 │ 6.489 │ -7.2 │ 12.3 │ 16.7 │ 21.6 │ 40.2 │ ▂▇▇▃ │ │ │ │ Temp3pm │ 3609 │ 2.4810944589577892 │ 21.68 │ 6.937 │ -5.4 │ 16.6 │ 21.1 │ 26.4 │ 46.7 │ ▁▇▇▃ │ │ │ └────────────────┴────────┴─────────────────────┴───────┴───────┴───────┴──────┴──────┴──────┴──────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓ │ │ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ chars per ┃ words ┃ total ┃ │ │ ┃ column ┃ NA ┃ NA % ┃ shortest ┃ longest ┃ min ┃ max ┃ row ┃ per row ┃ words ┃ │ │ ┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩ │ │ │ Date │ 0 │ 0 │ 2008-12- │ 2008-12- │ 2007-11- │ 2017-06- │ 10 │ 1 │ 145460 │ │ │ │ │ │ │ 01 │ 01 │ 01 │ 25 │ │ │ │ │ │ │ Location │ 0 │ 0 │ Sale │ Melbourn │ Adelaide │ Woomera │ 8.71 │ 1 │ 145460 │ │ │ │ │ │ │ │ eAirport │ │ │ │ │ │ │ │ │ WindGust │ 10326 │ 7.098858 │ W │ WNW │ E │ WSW │ 2.19 │ 0.93 │ 135134 │ │ │ │ Dir │ │ 79279527 │ │ │ │ │ │ │ │ │ │ │ WindDir9 │ 10566 │ 7.263852 │ W │ NNW │ E │ WSW │ 2.18 │ 0.93 │ 134894 │ │ │ │ am │ │ 60552729 │ │ │ │ │ │ │ │ │ │ │ │ │ 2 │ │ │ │ │ │ │ │ │ │ │ WindDir3 │ 4228 │ 2.906641 │ E │ WNW │ E │ WSW │ 2.21 │ 0.97 │ 141232 │ │ │ │ pm │ │ 00096246 │ │ │ │ │ │ │ │ │ │ │ │ │ 4 │ │ │ │ │ │ │ │ │ │ │ RainToda │ 3261 │ 2.241853 │ No │ Yes │ No │ Yes │ 2.22 │ 0.98 │ 142199 │ │ │ │ y │ │ 43049635 │ │ │ │ │ │ │ │ │ │ │ │ │ 6 │ │ │ │ │ │ │ │ │ │ │ RainTomo │ 3267 │ 2.245978 │ No │ Yes │ No │ Yes │ 2.22 │ 0.98 │ 142193 │ │ │ │ rrow │ │ 27581465 │ │ │ │ │ │ │ │ │ │ │ │ │ 7 │ │ │ │ │ │ │ │ │ │ └──────────┴───────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┴──────────┴───────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
Exploring Variables (Columns) & Observations (Rows)
If we are going to narrow our focus to specific variables or groups of observations, we need to know how to select columns, filter values, and group the data. There are lots of different ways we can slice up the data. We won’t cover all of them here2, but we will try to cover a range that helps illustrate how pandas works and will help you build the intuition for working with data in pandas.
We can select columns in a variety of ways, but the “correct” way to select columns in most circumstances is using selection brackets (the square brackets []
), also known as the indexing operator.
# selecting a single column by name
'Date']
df[
# alternative ways to select columns
# df.loc[:, 'Date']
# df.Date
0 2008-12-01
1 2008-12-02
2 2008-12-03
3 2008-12-04
4 2008-12-05
...
145455 2017-06-21
145456 2017-06-22
145457 2017-06-23
145458 2017-06-24
145459 2017-06-25
Name: Date, Length: 145460, dtype: object
If we want to select multiple columns, we can use double squared brackets ([[ ]]
). This is the same process as before, but the inner brackets define a list, and the outer are the selection brackets.
# selecting multiple columns (and all rows) by name
'Date', 'Location', 'Rainfall']]
df[[# df.loc[:, ['Date', 'Location', 'Rainfall']]
Date | Location | Rainfall | |
---|---|---|---|
0 | 2008-12-01 | Albury | 0.6 |
1 | 2008-12-02 | Albury | 0.0 |
2 | 2008-12-03 | Albury | 0.0 |
3 | 2008-12-04 | Albury | 0.0 |
4 | 2008-12-05 | Albury | 1.0 |
... | ... | ... | ... |
145455 | 2017-06-21 | Uluru | 0.0 |
145456 | 2017-06-22 | Uluru | 0.0 |
145457 | 2017-06-23 | Uluru | 0.0 |
145458 | 2017-06-24 | Uluru | 0.0 |
145459 | 2017-06-25 | Uluru | 0.0 |
145460 rows × 3 columns
While selection brackets are a quick and easy solution if we want to grab a subset of variables in the dataset, it is realy only intended to be used for simple operations using only column selection.
For row selection, we should use pd.DataFrame.iloc[]
. The iloc
function is used for “integer position” selection, which means you can select rows or columns using their integer position. For rows 10-15, you can select them using the following:
# slicing by rows
10:16] df.iloc[
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 2008-12-11 | Albury | 13.4 | 30.4 | 0.0 | NaN | NaN | N | 30.0 | SSE | ... | 48.0 | 22.0 | 1011.8 | 1008.7 | NaN | NaN | 20.4 | 28.8 | No | Yes |
11 | 2008-12-12 | Albury | 15.9 | 21.7 | 2.2 | NaN | NaN | NNE | 31.0 | NE | ... | 89.0 | 91.0 | 1010.5 | 1004.2 | 8.0 | 8.0 | 15.9 | 17.0 | Yes | Yes |
12 | 2008-12-13 | Albury | 15.9 | 18.6 | 15.6 | NaN | NaN | W | 61.0 | NNW | ... | 76.0 | 93.0 | 994.3 | 993.0 | 8.0 | 8.0 | 17.4 | 15.8 | Yes | Yes |
13 | 2008-12-14 | Albury | 12.6 | 21.0 | 3.6 | NaN | NaN | SW | 44.0 | W | ... | 65.0 | 43.0 | 1001.2 | 1001.8 | NaN | 7.0 | 15.8 | 19.8 | Yes | No |
14 | 2008-12-15 | Albury | 8.4 | 24.6 | 0.0 | NaN | NaN | NaN | NaN | S | ... | 57.0 | 32.0 | 1009.7 | 1008.7 | NaN | NaN | 15.9 | 23.5 | No | NaN |
15 | 2008-12-16 | Albury | 9.8 | 27.7 | NaN | NaN | NaN | WNW | 50.0 | NaN | ... | 50.0 | 28.0 | 1013.4 | 1010.3 | 0.0 | NaN | 17.3 | 26.2 | NaN | No |
6 rows × 23 columns
We can do similar using a column’s integer position, but we have to select all rows (:
) first:
# using iloc with columns
20] df.iloc[:,
0 21.8
1 24.3
2 23.2
3 26.5
4 29.7
...
145455 22.4
145456 24.5
145457 26.1
145458 26.0
145459 20.9
Name: Temp3pm, Length: 145460, dtype: float64
Finally, we can put both together to take a subset of both rows and columns:
# using iloc with rows and columns
10:16, 20] df.iloc[
10 28.8
11 17.0
12 15.8
13 19.8
14 23.5
15 26.2
Name: Temp3pm, dtype: float64
However, selecting by integer position is relatively limited. It is more likely we would want to subset the data based on the values of certain columns. We can filter rows by condition using pd.DataFrame.loc[]
. The loc
function slices by label, instead of integer position.
For example, we might want to look at a subset of the data based on location.
# select all observations in Perth
'Location'] == 'Perth'] df.loc[df[
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120638 | 2008-07-01 | Perth | 2.7 | 18.8 | 0.0 | 0.8 | 9.1 | ENE | 20.0 | NaN | ... | 97.0 | 53.0 | 1027.6 | 1024.5 | 2.0 | 3.0 | 8.5 | 18.1 | No | No |
120639 | 2008-07-02 | Perth | 6.4 | 20.7 | 0.0 | 1.8 | 7.0 | NE | 22.0 | ESE | ... | 80.0 | 39.0 | 1024.1 | 1019.0 | 0.0 | 6.0 | 11.1 | 19.7 | No | No |
120640 | 2008-07-03 | Perth | 6.5 | 19.9 | 0.4 | 2.2 | 7.3 | NE | 31.0 | NaN | ... | 84.0 | 71.0 | 1016.8 | 1015.6 | 1.0 | 3.0 | 12.1 | 17.7 | No | Yes |
120641 | 2008-07-04 | Perth | 9.5 | 19.2 | 1.8 | 1.2 | 4.7 | W | 26.0 | NNE | ... | 93.0 | 73.0 | 1019.3 | 1018.4 | 6.0 | 6.0 | 13.2 | 17.7 | Yes | Yes |
120642 | 2008-07-05 | Perth | 9.5 | 16.4 | 1.8 | 1.4 | 4.9 | WSW | 44.0 | W | ... | 69.0 | 57.0 | 1020.4 | 1022.1 | 7.0 | 5.0 | 15.9 | 16.0 | Yes | Yes |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
123826 | 2017-06-21 | Perth | 10.3 | 19.9 | 0.2 | 1.8 | 7.5 | NW | 37.0 | NNE | ... | 89.0 | 60.0 | 1017.1 | 1013.8 | 5.0 | 6.0 | 13.0 | 18.5 | No | Yes |
123827 | 2017-06-22 | Perth | 13.0 | 16.8 | 61.2 | 3.6 | 0.0 | SSW | 46.0 | W | ... | 90.0 | 75.0 | 1005.6 | 1008.9 | 7.0 | 7.0 | 16.4 | 15.6 | Yes | No |
123828 | 2017-06-23 | Perth | 13.3 | 18.9 | 0.4 | 1.8 | 6.5 | SE | 37.0 | SE | ... | 85.0 | 65.0 | 1019.2 | 1019.4 | 6.0 | 6.0 | 15.1 | 18.0 | No | No |
123829 | 2017-06-24 | Perth | 11.5 | 18.2 | 0.0 | 3.8 | 9.3 | SE | 30.0 | ESE | ... | 62.0 | 47.0 | 1025.9 | 1023.4 | 1.0 | 3.0 | 14.0 | 17.6 | No | No |
123830 | 2017-06-25 | Perth | 6.3 | 17.0 | 0.0 | 1.6 | 7.9 | E | 26.0 | SE | ... | 75.0 | 49.0 | 1028.6 | 1026.0 | 1.0 | 3.0 | 11.5 | 15.6 | No | No |
3193 rows × 23 columns
We can also filter by multiple values, such as location and rainfall.
'Rainfall'] == 0) & (df['Location'] == 'Perth')] df.loc[(df[
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120638 | 2008-07-01 | Perth | 2.7 | 18.8 | 0.0 | 0.8 | 9.1 | ENE | 20.0 | NaN | ... | 97.0 | 53.0 | 1027.6 | 1024.5 | 2.0 | 3.0 | 8.5 | 18.1 | No | No |
120639 | 2008-07-02 | Perth | 6.4 | 20.7 | 0.0 | 1.8 | 7.0 | NE | 22.0 | ESE | ... | 80.0 | 39.0 | 1024.1 | 1019.0 | 0.0 | 6.0 | 11.1 | 19.7 | No | No |
120644 | 2008-07-07 | Perth | 0.7 | 18.3 | 0.0 | 0.8 | 9.3 | N | 37.0 | NE | ... | 72.0 | 36.0 | 1028.9 | 1024.2 | 1.0 | 5.0 | 8.7 | 17.9 | No | No |
120645 | 2008-07-08 | Perth | 3.2 | 20.4 | 0.0 | 1.4 | 6.9 | NNW | 24.0 | NE | ... | 58.0 | 42.0 | 1023.9 | 1021.1 | 6.0 | 5.0 | 10.2 | 19.3 | No | Yes |
120651 | 2008-07-14 | Perth | 7.9 | 19.7 | 0.0 | 0.2 | 6.5 | NE | 31.0 | NE | ... | 86.0 | 41.0 | 1026.0 | 1021.9 | 6.0 | 5.0 | 11.7 | 18.7 | No | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
123823 | 2017-06-18 | Perth | 7.5 | 23.4 | 0.0 | 1.8 | 9.2 | NNE | 28.0 | ENE | ... | 67.0 | 41.0 | 1026.9 | 1022.9 | 0.0 | 0.0 | 14.2 | 22.2 | No | No |
123824 | 2017-06-19 | Perth | 5.5 | 23.0 | 0.0 | 3.0 | 9.1 | SW | 19.0 | ENE | ... | 84.0 | 55.0 | 1023.0 | 1020.3 | 1.0 | 2.0 | 11.5 | 22.0 | No | No |
123825 | 2017-06-20 | Perth | 7.8 | 22.5 | 0.0 | 2.8 | 9.1 | NW | 26.0 | W | ... | 98.0 | 59.0 | 1019.3 | 1015.9 | 1.0 | 1.0 | 13.5 | 21.6 | No | No |
123829 | 2017-06-24 | Perth | 11.5 | 18.2 | 0.0 | 3.8 | 9.3 | SE | 30.0 | ESE | ... | 62.0 | 47.0 | 1025.9 | 1023.4 | 1.0 | 3.0 | 14.0 | 17.6 | No | No |
123830 | 2017-06-25 | Perth | 6.3 | 17.0 | 0.0 | 1.6 | 7.9 | E | 26.0 | SE | ... | 75.0 | 49.0 | 1028.6 | 1026.0 | 1.0 | 3.0 | 11.5 | 15.6 | No | No |
2293 rows × 23 columns
For any complex process for subsetting the data, including multiple conditions, pd.DataFrame.loc[]
is the best bet.
Summarising Data
Now that we know how to select the variables or observations we are interested in, we can start doing some descriptive analysis. The operations we use will depend on the questions we are trying to answer, and the possibilities will be almost endless.
Questions:
- What “functions” might we need to carry out on our data when we are exploring it?
We know that the weather data includes observations from all over the country, but we might want to check exactly how many different locations there are. We can use pd.DataFrame.nunique()
to do this.
# count unique values
'Location'].nunique() df[
49
We may also be interested in the locations themselves, which may tell us more about the spatial distribution of our data. In this case, we can use pd.DataFrame.unique()
.
# get unique values
'Location'].unique() df[
array(['Albury', 'BadgerysCreek', 'Cobar', 'CoffsHarbour', 'Moree',
'Newcastle', 'NorahHead', 'NorfolkIsland', 'Penrith', 'Richmond',
'Sydney', 'SydneyAirport', 'WaggaWagga', 'Williamtown',
'Wollongong', 'Canberra', 'Tuggeranong', 'MountGinini', 'Ballarat',
'Bendigo', 'Sale', 'MelbourneAirport', 'Melbourne', 'Mildura',
'Nhil', 'Portland', 'Watsonia', 'Dartmoor', 'Brisbane', 'Cairns',
'GoldCoast', 'Townsville', 'Adelaide', 'MountGambier', 'Nuriootpa',
'Woomera', 'Albany', 'Witchcliffe', 'PearceRAAF', 'PerthAirport',
'Perth', 'SalmonGums', 'Walpole', 'Hobart', 'Launceston',
'AliceSprings', 'Darwin', 'Katherine', 'Uluru'], dtype=object)
Another common operation we might look to do is calculating the mean value (pd.DataFrame.mean()
) of a certain variable. What is the average value of sunshine across the entire dataset?
# calculate variable mean
'Sunshine'].mean() df[
np.float64(7.611177520661157)
This gives us the mean to many decimal places, and we probably don’t need to know the average sunshine hours to this level of precision. We can use the pd.DataFrame.round()
function to round to two decimal places.
# round mean value
'Sunshine'].mean().round(2) df[
np.float64(7.61)
Many operations will return the value with information about the object’s type included. The above values are wrapped in np.float64()
because pd.DataFrame.mean()
uses numpy to calculate the mean value. However, if you want to strip this information out so you only see the value itself, you can use print()
.
# print mean value
print(df['Sunshine'].mean().round(2))
7.61
While we are often interested in the mean value when we talk about averages, we might want to know the median instead (pd.DataFrame.median()
).
# calculate other summary statistics
print(df['Sunshine'].median())
8.4
Another common calculation is summing values (pd.DataFrame.sum()
). We can use sum()
to see the total hours of sunshine in our dataset, and we can use int()
to convert this value to an integer (which also means we don’t need to use print()
3).
# calculate sum value and return an integer
int(df['Sunshine'].sum())
575595
We can also apply these summary operations on multiple variables, using the same selection logic as before (using double squared brackets).
print(df[['Sunshine', 'Rainfall']].mean())
Sunshine 7.611178
Rainfall 2.360918
dtype: float64
And we can apply multiple functions, using pd.DataFrame.agg()
.
'Sunshine'].agg(['mean', 'median', 'sum']).round(1) df[
mean 7.6
median 8.4
sum 575595.3
Name: Sunshine, dtype: float64
The next step when exploring specific variables will often be group-level summaries. The average amount of sunshine across the whole dataset has limited utility, but the average hours of sunshine in each location allows us to compare between locations and start to understand how different variables are related to each other. If we want to do a group-level operation, we have to use pd.DataFrame.groupby()
.
# calculate group means
='Location')['Sunshine'].mean().round(1) df.groupby(by
Location
Adelaide 7.7
Albany 6.7
Albury NaN
AliceSprings 9.6
BadgerysCreek NaN
Ballarat NaN
Bendigo NaN
Brisbane 8.1
Cairns 7.6
Canberra 7.4
Cobar 8.7
CoffsHarbour 7.4
Dartmoor 6.5
Darwin 8.5
GoldCoast NaN
Hobart 6.6
Katherine NaN
Launceston NaN
Melbourne 6.4
MelbourneAirport 6.4
Mildura 8.5
Moree 8.9
MountGambier 6.5
MountGinini NaN
Newcastle NaN
Nhil NaN
NorahHead NaN
NorfolkIsland 7.0
Nuriootpa 7.7
PearceRAAF 8.8
Penrith NaN
Perth 8.8
PerthAirport 8.8
Portland 6.5
Richmond NaN
Sale 6.7
SalmonGums NaN
Sydney 7.2
SydneyAirport 7.2
Townsville 8.5
Tuggeranong NaN
Uluru NaN
WaggaWagga 8.2
Walpole NaN
Watsonia 6.4
Williamtown 7.2
Witchcliffe NaN
Wollongong NaN
Woomera 9.0
Name: Sunshine, dtype: float64
The groupby(by='Location')
function tells us the grouping variable (location), then we select the variable we want to summarise by location (sunshine), and then we specify the operation (mean).
There are multiple locations that return NaN
(Not a Number). This indicates that numpy was unable to calculate a mean value for those locations. This is likely to be because all sunshine values for those locations are null.
We can check this using pd.DataFrame.count()
, which counts the total non-null values (whereas pd.DataFrame.size()
counts the total values).
# group by location and count non-null sunshine values
'Location')['Sunshine'].count() df.groupby(
Location
Adelaide 1769
Albany 2520
Albury 0
AliceSprings 2520
BadgerysCreek 0
Ballarat 0
Bendigo 0
Brisbane 3144
Cairns 2564
Canberra 1521
Cobar 550
CoffsHarbour 1494
Dartmoor 2566
Darwin 3189
GoldCoast 0
Hobart 3179
Katherine 0
Launceston 0
Melbourne 3192
MelbourneAirport 3008
Mildura 2876
Moree 2055
MountGambier 2597
MountGinini 0
Newcastle 0
Nhil 0
NorahHead 0
NorfolkIsland 2570
Nuriootpa 2848
PearceRAAF 3004
Penrith 0
Perth 3188
PerthAirport 3004
Portland 2566
Richmond 0
Sale 1818
SalmonGums 0
Sydney 3328
SydneyAirport 2993
Townsville 2617
Tuggeranong 0
Uluru 0
WaggaWagga 2575
Walpole 0
Watsonia 3008
Williamtown 1355
Witchcliffe 0
Wollongong 0
Woomera 2007
Name: Sunshine, dtype: int64
The results show that all the locations that return NaN
in our group mean calculation have zero non-null values.
Transforming Data
Datasets are rarely perfectly clean and tidy. We often need to transform the data before we can get the most out of it.
Questions:
- What sort of transformations would help us get the most out of the analysis of the Australian weather data?
The first step with any analysis is often converting columns to the correct types. With a longitudinal (time-series) dataset,the date column is a good place to start. We can use pd.DataFrame.dtypes
to check the data type, either of a single column (using the selector brackets) or all columns in the dataset.
print(df.dtypes)
Date object
Location object
MinTemp float64
MaxTemp float64
Rainfall float64
Evaporation float64
Sunshine float64
WindGustDir object
WindGustSpeed float64
WindDir9am object
WindDir3pm object
WindSpeed9am float64
WindSpeed3pm float64
Humidity9am float64
Humidity3pm float64
Pressure9am float64
Pressure3pm float64
Cloud9am float64
Cloud3pm float64
Temp9am float64
Temp3pm float64
RainToday object
RainTomorrow object
dtype: object
All columns are either stored as object
or float64
. The object
data type is for generic non-numeric data, but from the columns that are stored as objects, we can tell this is mostly categorical variables where the categories are represented as text. The float64 data type refers to data that is numeric and includes decimals (float64 = 64-bit floating point number).
The date column is stored as an object, but pandas can store dates as datetime64
. We can convert dates using pd.to_datetime()
. When transforming data, if we want to keep those transformations, we have to store those changes, using =
. In this case, we want to convert the date column but we don’t want to create an entirely new dataframe to handle this change, so we can overwrite the current date column by using the selection brackets to identify the column we want to apply this change to.
# convert date column to datetime
'Date'] = pd.to_datetime(df['Date']) df[
The remaining object
columns can be converted to categorical, which makes them easier to work with in subsequent analyses. We can use pd.DataFrame.astype()
to convert column data types.
# create a list of all object columns
= ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
object_cols
# convert object columns to category
= df[object_cols].astype('category') df[object_cols]
A more efficient, though synactically more complex, way of doing this is using lamda functions. We won’t cover lambda functons in this session (they will be discussed in detail in a future session), but below is how we can use them to convert objects to categories.
# convert object columns to category data type
= df.apply(lambda x: x.astype('category') if x.dtype == 'object' else x) df
Another choice we might make is to remove missing values, using pd.DataFrame.dropna()
to filter the null values and keep only the non-null values. We can use this to drop all null values across the entire dataset, or we can apply it to a subset of columns, using the subset
argument.
# filter observations where sunshine is NA
='Sunshine') df.dropna(subset
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6049 | 2009-01-01 | Cobar | 17.9 | 35.2 | 0.0 | 12.0 | 12.3 | SSW | 48.0 | ENE | ... | 20.0 | 13.0 | 1006.3 | 1004.4 | 2.0 | 5.0 | 26.6 | 33.4 | No | No |
6050 | 2009-01-02 | Cobar | 18.4 | 28.9 | 0.0 | 14.8 | 13.0 | S | 37.0 | SSE | ... | 30.0 | 8.0 | 1012.9 | 1012.1 | 1.0 | 1.0 | 20.3 | 27.0 | No | No |
6051 | 2009-01-03 | Cobar | 15.5 | 34.1 | 0.0 | 12.6 | 13.3 | SE | 30.0 | NaN | ... | NaN | 7.0 | NaN | 1011.6 | NaN | 1.0 | NaN | 32.7 | No | No |
6052 | 2009-01-04 | Cobar | 19.4 | 37.6 | 0.0 | 10.8 | 10.6 | NNE | 46.0 | NNE | ... | 42.0 | 22.0 | 1012.3 | 1009.2 | 1.0 | 6.0 | 28.7 | 34.9 | No | No |
6053 | 2009-01-05 | Cobar | 21.9 | 38.4 | 0.0 | 11.4 | 12.2 | WNW | 31.0 | WNW | ... | 37.0 | 22.0 | 1012.7 | 1009.1 | 1.0 | 5.0 | 29.1 | 35.6 | No | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
142298 | 2017-06-20 | Darwin | 19.3 | 33.4 | 0.0 | 6.0 | 11.0 | ENE | 35.0 | SE | ... | 63.0 | 32.0 | 1013.9 | 1010.5 | 0.0 | 1.0 | 24.5 | 32.3 | No | No |
142299 | 2017-06-21 | Darwin | 21.2 | 32.6 | 0.0 | 7.6 | 8.6 | E | 37.0 | SE | ... | 56.0 | 28.0 | 1014.6 | 1011.2 | 7.0 | 0.0 | 24.8 | 32.0 | No | No |
142300 | 2017-06-22 | Darwin | 20.7 | 32.8 | 0.0 | 5.6 | 11.0 | E | 33.0 | E | ... | 46.0 | 23.0 | 1015.3 | 1011.8 | 0.0 | 0.0 | 24.8 | 32.1 | No | No |
142301 | 2017-06-23 | Darwin | 19.5 | 31.8 | 0.0 | 6.2 | 10.6 | ESE | 26.0 | SE | ... | 62.0 | 58.0 | 1014.9 | 1010.7 | 1.0 | 1.0 | 24.8 | 29.2 | No | No |
142302 | 2017-06-24 | Darwin | 20.2 | 31.7 | 0.0 | 5.6 | 10.7 | ENE | 30.0 | ENE | ... | 73.0 | 32.0 | 1013.9 | 1009.7 | 6.0 | 5.0 | 25.4 | 31.0 | No | No |
75625 rows × 23 columns
We haven’t stored this transformation, because filtering nulls without careful consideration is a bad idea, but it’s useful to know, nonetheless.
There are lots of ways we could transform the data, but the final example we will consider here is reshaping the data using pd.DataFrame.pivot()
, which transforms the data from long to wide format data, and pd.DataFrame.melt()
, which transforms it from wide to long format.
Perhaps we want to focus on the maximum temperature per day in each location in 2015. We can use pd.Series.dt.year
to get the year from the date column, and filter for the year 2015, before reshaping the data.
= df.loc[df['Date'].dt.year == 2015]
df2015 = df2015.pivot(index='Date', columns='Location', values='MaxTemp')
df_wide
df_wide.head()
Location | Adelaide | Albany | Albury | AliceSprings | BadgerysCreek | Ballarat | Bendigo | Brisbane | Cairns | Canberra | ... | Townsville | Tuggeranong | Uluru | WaggaWagga | Walpole | Watsonia | Williamtown | Witchcliffe | Wollongong | Woomera |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||||||||
2015-01-01 | 37.0 | 21.9 | 33.5 | 40.3 | 34.7 | 27.4 | 31.2 | 31.3 | 33.7 | 32.6 | ... | 32.6 | 32.1 | 42.0 | 35.2 | 23.6 | 28.3 | 33.7 | 25.0 | 25.3 | 39.2 |
2015-01-02 | 44.1 | 21.2 | 39.6 | 41.4 | 30.5 | 38.2 | 39.8 | 30.5 | 33.7 | 35.2 | ... | 33.0 | 34.1 | 42.4 | 38.9 | 21.1 | 40.6 | 29.3 | 23.6 | 24.6 | 43.3 |
2015-01-03 | 38.2 | 21.5 | 38.3 | 36.4 | 34.3 | 37.5 | 40.3 | 28.9 | 33.6 | 34.7 | ... | 28.1 | 33.7 | 39.8 | 37.5 | 21.8 | 39.5 | 32.8 | 23.0 | 25.7 | 44.7 |
2015-01-04 | 30.5 | 23.3 | 33.1 | 29.0 | 34.8 | 23.5 | 29.0 | 30.2 | 29.4 | 32.5 | ... | 31.6 | 32.8 | 36.1 | 33.8 | 24.4 | 25.1 | 34.5 | 29.8 | 25.3 | 37.6 |
2015-01-05 | 34.9 | 24.9 | 35.2 | 27.1 | 27.2 | 26.6 | 33.6 | 28.1 | 31.4 | 29.6 | ... | 31.6 | 28.9 | 38.8 | 34.9 | 29.5 | 25.7 | 27.0 | 31.7 | 23.1 | 38.3 |
5 rows × 49 columns
Perhaps we want to look at the maximum and minimum temperatures in each location, together. We can reshape the data to support this4.
= df2015.melt(
df_long =['Date', 'Location'],
id_vars=['MaxTemp', 'MinTemp'],
value_vars='Variable',
var_name='Value'
value_name
)
df_long.head()
Date | Location | Variable | Value | |
---|---|---|---|---|
0 | 2015-01-01 | Albury | MaxTemp | 33.5 |
1 | 2015-01-02 | Albury | MaxTemp | 39.6 |
2 | 2015-01-03 | Albury | MaxTemp | 38.3 |
3 | 2015-01-04 | Albury | MaxTemp | 33.1 |
4 | 2015-01-05 | Albury | MaxTemp | 35.2 |
Exercises
Some of these questions are easily answered by scrolling up and finding the answer in the output of the above code, however, the goal is to find the answer using code. No one actually cares what the answer to any of these questions is, it’s the process that matters!
Remember, if you don’t know the answer, it’s okay to Google it (or speak to others, including me, for help)!
Import Data (to Reset)
# import the dataset
= pd.read_csv('data/weatherAUS.csv') df
- What is the ‘Sunshine’ column’s data type?
# What is the 'Sunshine' column's data type?
print(df['Sunshine'].dtypes)
float64
- Identify all the columns that are of dtype ‘object’.
# Identify all the columns that are of dtype 'object'
print(list(df.select_dtypes(include=['object'])))
['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
- How many of the dataframe’s columns are of dtype ‘object’?
# How many of the dataframe's columns are of dtype 'object'?
len(list(df.select_dtypes(include=['object'])))
7
- How many of the ‘Rainfall’ column values are NAs?
# How many of the 'Rainfall' column values are NAs?
print(df['Rainfall'].isna().sum())
3261
- Create a new dataframe which only includes the ‘Date’, ‘Location, ’Sunshine’, ‘Rainfall’, and ‘RainTomorrow’ columns.
= df[['Date', 'Location', 'Sunshine', 'Rainfall', 'RainTomorrow']]
new_df new_df.head()
Date | Location | Sunshine | Rainfall | RainTomorrow | |
---|---|---|---|---|---|
0 | 2008-12-01 | Albury | NaN | 0.6 | No |
1 | 2008-12-02 | Albury | NaN | 0.0 | No |
2 | 2008-12-03 | Albury | NaN | 0.0 | No |
3 | 2008-12-04 | Albury | NaN | 0.0 | No |
4 | 2008-12-05 | Albury | NaN | 1.0 | No |
- Convert ‘RainTomorrow’ to a numeric variable, where ‘Yes’ = 1 and ‘No’ = 0.
# df['Location'].astype('category').cat.codes
# df['RainTomorrow'].astype('category').cat.codes
'RainTomorrow'].map({'Yes': 1, 'No': 0}) df[
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
...
145455 0.0
145456 0.0
145457 0.0
145458 0.0
145459 NaN
Name: RainTomorrow, Length: 145460, dtype: float64
- What is the average amount of rainfall for each location?
# average rainfall by location, sorted by value
'Location')['Rainfall'].mean().sort_values(ascending=False) df.groupby(
Location
Cairns 5.742035
Darwin 5.092452
CoffsHarbour 5.061497
GoldCoast 3.769396
Wollongong 3.594903
Williamtown 3.591108
Townsville 3.485592
NorahHead 3.387299
Sydney 3.324543
MountGinini 3.292260
Katherine 3.201090
Newcastle 3.183892
Brisbane 3.144891
NorfolkIsland 3.127665
SydneyAirport 3.009917
Walpole 2.906846
Witchcliffe 2.895664
Portland 2.530374
Albany 2.263859
BadgerysCreek 2.193101
Penrith 2.175304
Tuggeranong 2.164043
Dartmoor 2.146567
Richmond 2.138462
MountGambier 2.087562
Launceston 2.011988
Albury 1.914115
Perth 1.906295
Melbourne 1.870062
Watsonia 1.860820
PerthAirport 1.761648
Canberra 1.741720
Ballarat 1.740026
WaggaWagga 1.709946
PearceRAAF 1.669080
Moree 1.630203
Bendigo 1.619380
Hobart 1.601819
Adelaide 1.566354
Sale 1.510167
MelbourneAirport 1.451977
Nuriootpa 1.390343
Cobar 1.127309
SalmonGums 1.034382
Mildura 0.945062
Nhil 0.934863
AliceSprings 0.882850
Uluru 0.784363
Woomera 0.490405
Name: Rainfall, dtype: float64
- What is the average amount of rainfall for days that it will rain tomorrow?
# average rainfall depending on whether it will rain tomorrow or not
'RainTomorrow')['Rainfall'].mean() df.groupby(
RainTomorrow
No 1.270290
Yes 6.142104
Name: Rainfall, dtype: float64
- What is the average amount of sunshine in Perth when it will not rain tomorrow?
# average sunshine in Perth when it won't rain tomorrow
'Location'] == 'Perth') & (df['RainTomorrow'] == 'No'), 'Sunshine'].mean()
df.loc[(df[# df[(df['Location']=='Perth') & (df['RainTomorrow']=='No')]['Sunshine'].mean()
np.float64(9.705306603773584)
- We want to understand the role that time plays in the dataset. Using the original dataframe, carry the following tasks and answer the corresponding questions:
- Create columns representing the year and month from the ‘Date’ column. How many years of data are in the dataset?
- Examine the distribution of the ‘Sunshine’ NAs over time. Is time a component in the ‘Sunshine’ data quality issues?
- Calculate the average rainfall and sunshine by month. How do rainfall and sunshine vary through the year?
- Calculate the average rainfall and sunshine by year. How have rainfall and sunshine changed over time?
# get year and month columns
= (
df =pd.to_datetime(df['Date']))
df.assign(Date
.assign(=lambda x: x['Date'].dt.year,
Year=lambda x: x['Date'].dt.month
Month
)
)
# count unique years
'Year'].nunique() df[
11
# lambda function counting nulls by year
'Year')['Sunshine'].apply(lambda x: x.isna().sum()) df.groupby(
Year
2007 0
2008 323
2009 6146
2010 6220
2011 6053
2012 6539
2013 7570
2014 9157
2015 9441
2016 11994
2017 6392
Name: Sunshine, dtype: int64
# rainfall and sunshine by month
'Month')[['Rainfall', 'Sunshine']].mean().round(1) df.groupby(
Rainfall | Sunshine | |
---|---|---|
Month | ||
1 | 2.7 | 9.2 |
2 | 3.2 | 8.6 |
3 | 2.8 | 7.6 |
4 | 2.3 | 7.1 |
5 | 2.0 | 6.3 |
6 | 2.8 | 5.6 |
7 | 2.2 | 6.1 |
8 | 2.0 | 7.1 |
9 | 1.9 | 7.7 |
10 | 1.6 | 8.5 |
11 | 2.3 | 8.7 |
12 | 2.5 | 9.0 |
# rainfall and sunshine by year
'Year')[['Rainfall', 'Sunshine']].mean().round(1) df.groupby(
Rainfall | Sunshine | |
---|---|---|
Year | ||
2007 | 3.2 | 8.1 |
2008 | 2.3 | 7.8 |
2009 | 2.2 | 7.9 |
2010 | 2.7 | 7.3 |
2011 | 2.8 | 7.3 |
2012 | 2.4 | 7.6 |
2013 | 2.3 | 7.7 |
2014 | 2.0 | 7.8 |
2015 | 2.2 | 7.7 |
2016 | 2.4 | 7.6 |
2017 | 2.5 | 7.7 |
Footnotes
For more information about pandas data types, check out the pandas documentation on dtypes.↩︎
For more information, I’d recommend the pandas documentation, and this pandas tutorial on subsetting data.↩︎
Some functions should be wrapped in
print()
in order to return a value that is easy to read, but others won’t. There will be an internal logic for which is which, but it’s not of huge importance to us. You are better off just testing functions out and wrapping them inprint()
if necessary.↩︎This is often very useful when we need to visualise data, for example plotting the max and min temp for each location, is easier if the values are organised in the same column and differentiated using another column.↩︎