Machine Learning Workflow

This is the final session in our machine learning series. This session shifts attention to the workflow itself, showing what an end-to-end workflow actually looks like from problem definition through to deployment.

We will use the Titanic dataset again, for familiarity, but the goal here is not the model, it is the process. We will build on the machine learning workflow we explored in the first two sessions, adding to the workflow by doing more preprocessing of the data and feature engineering, and building more robust, generalisable models using cross validation, and hyperparameter tuning. We will also look at how to build pipelines that support this process and avoid data leakage.

Slides

Use the left ⬅️ and right ➡️ arrow keys to navigate through the slides below. To view in a separate tab/window, follow this link.

An Overview of the Machine Learning Workflow

Any end-to-end machine learning project will follow a broadly similar workflow. It should include the following steps:

Problem Definition - what are you actually trying to predict, and why?
Data Collection - what do you have, what’s missing, what needs fixing?
Data Preparation - transforming raw data into useful model inputs
Model Training - selecting and training a model
Model Validation - honestly evaluating whether it works
Model Deployment - putting the model to use and keeping it working

What each of these steps should look like will depend on what you are trying to achieve and the data you are working with. The steps in the workflow that we will build on today are primarily focused on the data preparation and the model training and validation.

While the focus here is on building the model, it’s important to highlight that this is just a small part of the full workflow. In practice, data exploration (which is an iterative step that informs both the data collection and data preparation steps) will often take up the majority of the time spent on the project, and model deployment is perhaps the most important part of the process¹. The model training step is, in reality, usually the smallest part of the project².

Building Better Models

Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

Data Collection & Exploration

df = sns.load_dataset('titanic')

print(f"Shape: {df.shape}")
print(f"\nSurvival Rate: {df['survived'].mean():.1%}")
print(f"\nMissing Values:\n{df.isnull().sum()[df.isnull().sum() > 0]}")

Shape: (891, 15)

Survival Rate: 38.4%

Missing Values:
age            177
embarked         2
deck           688
embark_town      2
dtype: int64

Age has a meaningful number of missing values. Deck is almost entirely missing and not worth using. Embarked has two missing values, which is negligible but needs handling.

Data Preparation

# select the columns we want to work with
df = df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'alone', 'embark_town']].copy()

# encode sex as a binary integer
df['sex'] = (df['sex'] == 'female').astype(int)

# create dummy variables for embark town
df['embark_town'] = df['embark_town'].str.lower()
df = pd.get_dummies(df, columns=['embark_town'], prefix="embark", dtype=int)

Age and fare still have missing values. These will be handled inside a Pipeline rather than here, the reason for this is explained below.

Train/Test Split

The split comes before any imputation or scaling. This is deliberate. Preprocessing steps must be fit only on training data, never on the full dataset.

X = df.drop(columns='survived')
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training Set: {X_train.shape[0]} Rows")
print(f"Test Set:     {X_test.shape[0]} Rows")

Training Set: 712 Rows
Test Set:     179 Rows

For a reminder of why it is important to split the data before training a model, check out MLU-Explain’s Train, Test, and Validation Sets article.

Data Preprocessing

We can use scikit-learn’s Pipeline to build a process for carrying out the data preprocessing and applying it to the train and test sets separetely. This reduces the risk of data leaking from the test set into the model training.

preprocessor = Pipeline([
    # fill missing age and fare with training medians
    ('imputer', SimpleImputer(strategy='median')),
    # scale all features to zero mean, unit variance
    ('scaler', StandardScaler())
])

While we have carried out these simple steps to make the data easier to model, there are many ways we might transform the data to amplify the signal for the model training. We haven’t carried out any feature engineering, but for ideas and inspiration for feature engineering, try Feature Engineering A-Z.

Model Training

Baseline Model

A baseline estimate is essential before modelling. If 62% of passengers did not survive, a model that always predicts “did not survive” achieves 62% accuracy while being completely useless. Any model we build must outperform this to demonstrate it has learned something.

dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(X_train, y_train)
baseline_accuracy = accuracy_score(y_test, dummy_clf.predict(X_test))
print(f"Baseline Accuracy: {baseline_accuracy:.1%}")

Baseline Accuracy: 61.5%

The baseline model doesn’t use the preprocessor pipeline created earlier because it is just identifying the most frequent outcome in the training data and predicting this is the outcome for each test observation.

Building a Model Pipeline

We can create a pipeline that feeds the transformed data from the preprocessor pipeline we created earlier to a random forest model, in a nested pipeline.

model = Pipeline(steps=[
    ("preprocessor", preprocessor), 
    ("classifier", RandomForestClassifier(random_state=42))
])

Alternatively, the simpler approach would be to create a model pipeline that includes the preprocessing steps.

model = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Random forests and other tree-based models are scale invariant, which means that it is unnecessary to scale the data to improve model performance. However, we will keep this step in the pipeline for illustrative purposes (scaling can be very valuable with other models).

Cross Validation

The first two sessions evaluated models on a single held-out test set. This is vulnerable to noise in the training data and variance between the training and test data. The solution is to use cross validation. Cross validation splits the training data into \(k\) “folds” and evaluating the model across each split. Each fold takes a turn as the test set and then the scores are averaged. No data is wasted, and the variance across folds indicates how stable the estimate is.

For a visual demonstration of what cross validation does and why it is so valuable, try MLU-Explain’s Cross Validation article.

cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

print(f"Fold Accuracy Scores: {cv_scores.round(2)}")
print(f"Mean: {cv_scores.mean():.2f}  (+/- {cv_scores.std():.2f})")

Fold Accuracy Scores: [0.79 0.71 0.82 0.85 0.81]
Mean: 0.80  (+/- 0.05)

The accuracy scores for each cross validated fold are more generalisable than using the entire training data, but we can also tune the random forest model to maximise performance.

Hyperparameter Tuning

Every algorithm will have settings (hyperparameters) that aren’t learned from data (e.g. max_depth, n_estimators). These need to be set before training the model. More complex algorithms, like random forests and gradient boosting models, have many hyperparameters, and their values can have a significant impact on model performance. The best approach is to tune the hyperparameters by testing multiple values and finding the combination of hyperparameter values that produce the best model performance.

GridSearchCV searches over a specified grid of values, evaluating each combination using cross validation. Because it uses cross validation internally, the selection process does not touch the held-out test set. This reduces the risk of the model overfitting.

param_grid = {
    'classifier__n_estimators': [100, 300],
    'classifier__max_depth': [3, 5, None],
    'classifier__min_samples_leaf': [1, 5],
}

grid_search = GridSearchCV(
    model,
    param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

print(f"Best Parameters:  {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.2f}")

Best Parameters:  {'classifier__max_depth': 3, 'classifier__min_samples_leaf': 1, 'classifier__n_estimators': 100}
Best CV Accuracy: 0.84

Grid Search is a simple approach to hyperparameter tuning, but there are lots of different methods. For more information about hyperparameter tuning, check out Saket Jain’s overview, Hyperparameter Tuning in Machine Learning ³.

Model Validation

The test set has been held out through the entire process. We can now use it to evaluate the model’s performance. This gives an unbiased estimate of how the model should perform on new data.

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print(f"Baseline Accuracy: {baseline_accuracy:.1%}")
print(f"Final Model Accuracy: {final_accuracy:.1%}")
print(f"Improvement Over Baseline: {final_accuracy - baseline_accuracy:+.1%}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))

Baseline Accuracy: 61.5%
Final Model Accuracy: 79.3%
Improvement Over Baseline: +17.9%

Classification Report:
              precision    recall  f1-score   support

        Died       0.77      0.94      0.85       110
    Survived       0.85      0.57      0.68        69

    accuracy                           0.79       179
   macro avg       0.81      0.75      0.76       179
weighted avg       0.80      0.79      0.78       179

This series of machine learning sessions has relied on accuracy to evaluate model performance, but it’s important to remember that there are lots of different evaluation metrics, all of which serve slightly different purposes. The metric you use to evaluate your model depends, much like other choices in the machine learning workflow, on what you are trying to achieve and the data you are working with.

We have used accuracy because it is a more familiar concept for new users, but accuracy is a limited metric and it is often better to use other approaches to evaluate a classification model. The MLU-Explain visual explanation of classification metrics is a great place to start when learning more about the metrics you should choose for specific tasks⁴.

Model Deployment

A model that performs well on the held-out test set is a good start, but that doesn’t mean it is ready to be released into the wild and used in anger. Getting a model into production, where it changes decisions and has a real-world impact, requires greater scrutiny and significant thought about the pipeline that will feed the model and how the outputs will be served to users.

This is out of scope for these sessions. It is a complex topic that deserves multiple sesions of its own. Still, I wanted to highlight that the model workflow doesn’t end when you’ve created a model that performs well on the test data.

Some of the questions that you need to answer when deploying a machine learning model include:

How will it receive data and how will it return prediction?
Who sees the model outputs and what will they do with them?
What happens when the model goes wrong?
How will you detect when its performance degrades⁵ over time (for example, as the underlying population changes)?

There are lots of good resources for learning about model deployment. Christian Kästner’s Machine Learning in Production is a very thorough overview of the topic, while Anyscale’s Made with ML is a great place to start for thinking about the wider workflow all in the context of getting models into production. I would also recommend Naeem Ul Haq’s guide for Machine Learning System Design and, for anyone looking to go into greater detail, Chip Huyen’s Designing Machine Learning Systems, for starting to think about machine learning in terms of wider systems beyond the model itself.

Where Next?

This series of machine learning sessions has only scratched the surface. The goal throughout this series is only to introduce the wider world of machine learning and give learners the tools to start their journey. There is a lot that hasn’t been covered here, or that has been covered only briefly.

Below are some resources that can help you get started on the next steps in that journey:

Hands-On Machine Learning (the print edition of the book is available here) - This book is a common starting point for many when they first explore machine learning. It is a really good applied overview, though it’s worth noting that Tensorflow is less popular nowadays (PyTorch is more common).
Approaching (Almost) Any Machine Learning Problem - This is another great starting point for exploring machine learning in an applied context.
An Introduction to Statistical Learning - For a more theoretical, statistically-grounded approach, ISL is excellent. Patterns, Predictions, and Actions is also a good theoretical resource.
StatQuest’s Machine Learning YouTube playlist - If you prefer to learn visually, StatQuest is an extremely thorough but accessible resource for everything statistics and machine learning. The machine learning playlist starts with the very basics and also does a great job explaining more complex topics later in the playlist.
Machine Learning Crash Course - For an online course, Google’s crash course on machine learning is a solid resource that covers a lot.
Interpretable Machine Learning - Finally, Christoph Molinar’s Interpretable Machine Learning is a great book for learning about machine learning outputs and thinking about producing systems that are accessible to non-experts.

These resources will help you take the next step and continue exploring machine learning, but there is no better education than just building things. The best way to gain greater understanding of machine learnig and build on what we’ve covered here is to try it out or yourself. This doesn’t necessarily have to be a work project. It could be a side project instead⁶. But machine learning is a topic that is best learned by doing.

Footnotes

If you can’t get the model into production and use the model’s outputs, it doesn’t really matter how good the model is.↩︎
The reason so many learning resources, this session included, focus so heavily on the model training, is because, while it may not be the most intensive part of the process when you are familiar with building machine learning models, a lot of the concepts are relatively (or completely) new to learners, and it is easy to build a bad model! The exploratory phase will be familiar to a lot of analysts, so requires less attention when learning.↩︎
For an in-depth discussion about hyperparameter tuning, I would recommend Jeremy Jordan’s blog post, Hyperparameter Tuning for Machine Learning Models.↩︎
I would also recommend their post exploring ROC & AUC, which can be used to evaluate model performance across the range of outcome values, helping to find the right threshold (the probability at which observations are classified as positive classes) for your model.↩︎
All models will degrade over time, as the world and the context being modeled changes. A fundamental part of deploying the model is planning how to monitor its performance and identify when the model has degraded enough that it needs retraining.↩︎
In fact, it can often be better to learn using low-stakes side projects. When you get to make all the choices about data, infrastructure, process, and outputs, this gives you the opportunity to think about designing a system without constraints. And when the project has no real-world risks (like information governance and ethical concerns around model performance), it is easier to play around and explore what works without fear of consequences.↩︎