An Introduction to Linear Regression

The Model That Explains (Almost) Everything

Specialist Analytics Team (SAT)

Fitting a Line to Data

The First Building Block in a Regression

Scatterplots Show Patterns

Fitted Lines Show Associations

Regression is Just Fitting Lines to Data

Regression finds the line that passes through the data in the way that minimises the distance between the line and all observations.
This is often referred to as the “line of best fit”.
The line of best fit is the best representation of the relationship between two variables given the data.

Regression Foundations

A Conceptual Introduction

Regression Estimates Effects

Regression quantifies the relationship between one or more explanatory variables (or predictors) and the outcome (the variable we want to analyse).
Correlation indicates whether two variables are related. Regression tells us by how much.
For example, if flipper length increases by 10mm, body mass increases by ~500g.
This precision opens up a world of possibilities for analysis.

Why Description is Not Enough

Describing the pattern is useful, but we often want more.
- Will a penguin with a 205mm flipper be heavier than one with 195mm?
- If we measure flipper length, can we estimate body mass?
- Does this relationship hold for new penguins we haven’t seen?
Regression lets us make predictions, test hypotheses, and make inferences.

Linear Regression Fits a Straight Line

The most simple regression model is one that fits a straight line through two variables, \(X\) (the predictor) and \(Y\) (the outcome).
- The straight line indicates that the relationship between \(X\) and \(Y\) is constant.
- Changes in the value of \(X\) have the same effect on \(Y\) across the entire range of \(X\).
This is linear regression. It is the most common type of regression, and the foundation for so much more.

Regression Minimises Prediction Error

The line of best fit minimises the distance between the line (the predicted value) and observations.
Linear regression uses the following method to do this:
1. Calculate the error (residual) for each point.
2. Square each error (so they’re all positive).
3. Sum all squared errors, called the Residual Sum of Squares (RSS).
4. Find the line that minimises the RSS.
This is called Ordinary Least Squares (OLS) estimation.

Finding the Line of Best Fit

Comparing Different Lines

Predictions Come From the Fitted Line

Once you have a line fitted to the data, you can make predictions (and with a little more work, inferences).
- The predicted outcome for a given value of \(X\) is just the \(Y\) value of the line at \(X\).
- In regression, predictions are sometimes referred to as fitted values.
- The difference between the predicted value from the line and the actual value is the prediction error, or the residual.
Inferences (statements about the effect of \(X\) on \(Y\)) are derived from the calculation of the slope of the line.
- The slope of the fitted line suggests that 10mm increases in flipper length result in 500g increases in body mass.
- In regression, these values are referred to as coefficients.

Reading Fitted Values & Residuals

 Flipper Length  Body Mass  Fitted  Residual
          178.0     3250.0  3055.2     194.8
          196.0     3675.0  3957.9    -282.9
          195.0     4000.0  3907.8      92.2
          210.0     4850.0  4660.1     189.9
          192.0     4050.0  3757.3     292.7
          201.0     4300.0  4208.7      91.3
          197.0     3300.0  4008.1    -708.1
          220.0     5400.0  5161.6     238.4
          229.0     5800.0  5613.0     187.0
          210.0     4400.0  4660.1    -260.1

Dispersion Around the Line is Unexplained Variance

A (Brief) Look Under the Hood

The Component Pieces of Linear Regression

The Regression Formula

The formula for a simple linear regression model, predicting \(Y\) with one predictor \(X\):

\[ Y = \underbrace{\vphantom{\beta_0} \overset{\color{#41B6E6}{\text{Intercept}}}{\color{#41B6E6}{\beta_0}} + \overset{\color{#005EB8}{\text{Slope}}}{\color{#005EB8}{\beta_1}}X \space \space}_{\text{Explained Variance}} + \overset{\mathstrut \color{#ED8B00}{\text{Error}}}{\underset{\text{Unexplained}}{\color{#ED8B00}{\epsilon}}} \]

This breaks the problem down into three components, and estimates two parameters:
- \(\beta_0\) - The intercept, estimating the average value of \(Y\) when \(X = 0\).
- \(\beta_1\) - The slope, estimating the effect that \(X\) has on the outcome, \(Y\).
- \(\epsilon\) - The error term, capturing the remaining variance in the outcome \(Y\) that is not explained by the rest of the model.

Applying the Formula

The regression formula for predicting or explaining body mass from flipper length:

\[ \text{Body Mass} = \beta_0 + \beta_1 \times \text{Flipper Length} \]

Intercept (\(\beta_0\)) = -5872.1g
- Average body mass when flipper length equals zero (not meaningful here, but necessary for the line).
Slope (\(\beta_1\)) = 50.2g
- How much mass changes per mm increase in flipper length.

What Makes Regression So Powerful?

Simplicity Combined with Flexibility

Regression Can Handle Multiple Variables

Unlike correlation and a lot of data visualisation, regression is not pairwise.
Regression can include many predictors.
You can add as many variables to a regression model as you want.
- Though you should only add those that matter (and even then less is often more).
We know that flipper length influences body mass, but perhaps bill length and/or bill depth also matter?

From Linear to Generalised Linear Models

Linear regression assumes your outcome is continuous and normally distributed, but this is often not the case.
Generalised linear models extend this idea to other types of outcome using a “link function”.
The link function transforms your outcome so linear regression can work on it. It bends the scale so a straight line fits the data.
- Logistic regression (logit link function) - Binary outcomes (survival vs death, electoral victory vs loss).
- Poisson regression (log link function) - Count outcomes (number of attendances).
Generalised linear models use the same core idea, but with a transformation step before fitting the line.

Dealing with Grouping Structures

Linear regression assumes all observations are independent.
Sometimes (often) there are higher-level grouping structures that moderate the effect of \(X\) on \(Y\).
- Students in different classes (or schools).
- Patients attending different hospitals.
- Different species of penguin.
Multilevel models account for grouping structure by fitting lines to each group without treating each group as completely distinct.
- Groups can share information while still differing.
Same linear framework, just allowing for hierarchy and grouping structure.

The Possibilities are Endless

These are just three examples of the ways that the simple linear model can be adapted to fit different needs.
The real power of linear regression is that it combines simplicity with flexibility.
It works, and it can work in so many different situations.

Wrapping Up

A Final Sales Pitch for Linear Regression

Key Takeaways

If things vary together, we can measure that and use it to understand the relationship between them.
Regression quantifies those relationships. It asks not just whether variables are related, but by how much.
Regression is just fitting lines to data.
- When we minimise the error in the line of best fit, we get a line that describes how variables are related in the data.
- This allows us to make predictions about unseen data and inferences about the relationship between variables.
The linear model is incredibly powerful, but also incredibly flexible.
- Once you’ve figured out linear regression, you have an entire toolbox at your disposal.

Further Learning

MLU-Explain - Linear Regression
StatQuest (Josh Starmer) - Linear Regression
jbstatistics (Jeremy Balka) - Simple Linear Regression
Andrew Gelman et al. - Regression & Other Stories

Thank You!

Contact:

Code & Slides:

/NHS-South-Central-and-West/code-club

… And don’t forget to give us your feedback.