Linear Regression from Scratch

Building a Simple Linear Model to Better Understand Regression Methods

Paul Johnson

The Fundamentals of Regression

Get in Loser, We’re Fitting Lines to Data

What is Regression?

Regression models predict an outcome variable from a set of predictors. This can take the form of straight lines fit through the data, to many non-linear generalisations (Gelman, Hill, and Vehtari 2021).
Regression can be used to describe the association between variables, estimate the effect that a treatment has on an outcome, or predict how an outcome will change in response to changes in the predictor variables.
It is the foundation for many advanced statistical methods, and it is one of the most important tools available to an analyst seeking to use a sample to make inferences or predictions about the population (Rowntree 2018).
While powerful, the method itself is (relatively) simple, when boiled down to its component parts.

It’s All About the (Co)Variance

It is possible to make predictions or estimate effects between quantities of interest by analysing how they vary, both independently and together.
The extent to which the predictor variable, \(X\), and the outcome, \(Y\), move together is called “covariance”.
- A positive covariance indicates that the values of \(X\) and \(Y\) tend to increase and decrease together, while a negative covariance indicates that when \(X\) decreases \(Y\) increases.
Regression fits a line through data that estimate how the outcome variable changes when the value of the predictor variable(s) change. Finding the line that does the most effective job of passing through the data helps describe the relationship between the variables, or predict the outcome from the predictors.
This doesn’t sound like much, but it’s a surprisingly powerful idea.

Fitting a Line Through Data

At it’s most basic conceptual level, regression is just finding the line of best fit through the data.
- The complexity comes in the need to use precise, unbiased methods to find the “best” fit.
There are various “estimators” that can be used to fit a straight line to data, but the most common method is called ordinary least squares (OLS).
OLS finds the line of best fit by minimising the total distance between the line and all observed values. The distance between the line, which represents the model’s fitted values (or the predicted value of \(Y\) given \(X\)) and each observed value in the data is known as the residual.
All paths lead to minimising the residuals.

Visualising the Outcome Variance

df |> 
  ggplot(aes(x = exam_score)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Exam Score", y = NULL)

Visualising the Outcome Variance

df <- df |> filter(exam_score < 80)
  
df |> 
  ggplot(aes(x = exam_score)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Exam Score", y = NULL)

Visualising the Predictor Variance

df |> 
  ggplot(aes(x = hours_studied)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Hours Studied", y = NULL)

Visualising their Covariance

df |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  labs(x = "Hours Studied", y = "Exam Score")

Fitting a Regression

model <- lm(exam_score ~ hours_studied, data = df)

df <-
  df |> 
  mutate(
    fitted = model$fitted.values, 
    residual = model$residuals
    )

df |> 
  select(hours_studied, exam_score, fitted, residual) |> 
  janitor::clean_names(case = "title") |> 
  slice_sample(n = 10) |> 
  gt() |> 
  fmt_number(columns = c(Fitted, Residual), decimals = 2) |> 
  cols_align(align = "center", columns = everything())

Hours Studied	Exam Score	Fitted	Residual
12	59	64.74	−5.74
23	65	67.95	−2.95
16	63	65.91	−2.91
18	71	66.49	4.51
21	67	67.37	−0.37
16	63	65.91	−2.91
19	69	66.79	2.21
2	66	61.82	4.18
30	67	70.00	−3.00
26	68	68.83	−0.83

Finding the Line of “Best Fit”

coefs <- model$coefficients

df <- 
  df |> 
  mutate(
    fitted_under = coefs[1] + (coefs[2] - 0.1) * hours_studied,
    residual_under = exam_score - fitted_under
    )

df |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_line(
    aes(x = hours_studied, y = fitted_under), 
    linewidth = 1, colour = "#005EB8"
  ) +
  labs(
    x = "Hours Studied", y = "Exam Score"
  )

Finding the Line of “Best Fit”

df <- 
  df |> 
  mutate(
    fitted_over = coefs[1] + (coefs[2] + 0.2) * hours_studied,
    residual_over = exam_score - fitted_over
    )


df |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_line(
    aes(x = hours_studied, y = fitted_over), 
    linewidth = 1, colour = "#005EB8"
  ) +
  labs(
    x = "Hours Studied", y = "Exam Score"
  )

Finding the Line of “Best Fit”

df |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_smooth(
    method = lm, se = FALSE, linewidth = 1, colour = "#005EB8"
    ) +
  labs(
    x = "Hours Studied", y = "Exam Score"
    )

Minimising the Residuals

The least squares estimator takes the residual value of all of the regression’s predictions, squares them, and then sums them to get a single value that captures how well the model fits to the data.
- Squaring the value of each residual ensures sure that all values are positive, so that they don’t cancel each other out, and gives greater weight to larger residuals, making sure that the fitted line accounts for values that are far away from the mean. This value is known as the residual sum of squares (RSS).
- \(\text{RSS} = \sum_{i=1}^n \left( y_i - (\beta_0 + \beta_1 x_i) \right)^2\)
Finding the line that minimises the RSS produces an unbiased linear estimator (assuming that the regression assumptions are all met), or the line of best fit!

Minimising the Residuals

set.seed(42)


df |> 
  slice_sample(n = 50) |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_segment(
    aes(
      x = hours_studied, y= fitted_under, 
      xend = hours_studied, yend = fitted_under + residual_under
      ), 
    linewidth = 1, color = "#ED8B00"
    ) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_line(
    aes(x = hours_studied, y = fitted_under), 
    linewidth = 1, colour = "#005EB8"
  ) +
  labs(x = "Hours Studied", y = "Exam Score")

Minimising the Residuals

set.seed(42)

df |> 
  slice_sample(n = 50) |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_segment(
    aes(
      x = hours_studied, y= fitted_over, 
      xend = hours_studied, yend = fitted_over + residual_over
      ), 
    linewidth = 1, color = "#ED8B00"
    ) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_line(
    aes(x = hours_studied, y = fitted_over), 
    linewidth = 1, colour = "#005EB8"
  ) +
  labs(x = "Hours Studied", y = "Exam Score")

Minimising the Residuals

set.seed(42)

df |>
  slice_sample(n = 50) |> 
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_segment(
    aes(
      x = hours_studied, y= fitted, 
      xend = hours_studied, yend = fitted + residual
      ), 
    linewidth = 1, color = "#ED8B00"
    ) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_line(
    aes(x = hours_studied, y = fitted), 
    linewidth = 1, colour = "#005EB8"
    ) +
  labs(x = "Hours Studied", y = "Exam Score")

Minimising the Residuals

df |>
  ggplot(aes(x = hours_studied, y = exam_score)) +
  geom_segment(
    aes(
      x = hours_studied, y= fitted, 
      xend = hours_studied, yend = fitted + residual
      ), 
    linewidth = 1, color = "#ED8B00"
    ) +
  geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
  geom_line(
    aes(x = hours_studied, y = fitted), 
    linewidth = 1, colour = "#005EB8"
    ) +
  labs(x = "Hours Studied", y = "Exam Score")

Solve for \(\beta_0\) and \(\beta_1\); Minimise RSS

\[\min_{\hat\beta_0,\hat\beta_1} \sum_{i=1}^n \left( y_i - (\hat\beta_0 + \hat\beta_1 x_i) \right)^2\]

Minimising RSS gives us the line that best fits the data, but we don’t know what \(\beta_0\) or \(\beta_1\) are!

Solving for \(\beta_0\):
Differentiate: \(\frac{\partial}{\partial \hat{\beta}_0} \text{RSS} = -2 \sum_{i=1}^n \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)\)
Solve: \(\hat\beta_0 = \bar{y} - \hat\beta_1 \bar{x}\)

Solving for \(\beta_1\):
Differentiate: \(\frac{\partial}{\partial \hat{\beta}_1} \text{RSS} = -2 \sum_{i=1}^n x_i \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)\)
Solve: \(\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\)

Final coefficients: \(\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}, \quad \beta_0 = \bar{y} - \beta_1 \bar{x}\)

The Components of Simple Linear Regression

Our good friends \(\beta_1\), \(\beta_0\), and \(\epsilon\).

The Most Perfect Little Equation

The formula for a simple linear regression model, predicting \(Y\) with one predictor \(X\):

\[ Y = \underbrace{\vphantom{\beta_0} \overset{\color{#41B6E6}{\text{Intercept}}}{\color{#41B6E6}{\beta_0}} + \overset{\color{#005EB8}{\text{Slope}}}{\color{#005EB8}{\beta_1}}X \space \space}_{\text{Explained Variance}} + \overset{\mathstrut \color{#ED8B00}{\text{Error}}}{\underset{\text{Unexplained}}{\color{#ED8B00}{\epsilon}}} \]

This breaks the problem down into three components, and estimates two parameters:
- \(\beta_1\) - The slope, estimating the effect that \(X\) has on the outcome, \(Y\).
- \(\beta_0\) - The intercept, estimating the average value of \(Y\) when \(X = 0\).
- \(\epsilon\) - The error term, capturing the remaining variance in the outcome \(Y\) that is not explained by the rest of the model.

Calculating the Regression Slope (\(\beta_1\))

\[\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

Calculating the Regression Intercept (\(\beta_0\))

\[\beta_0 = \bar{y} - \beta_1 \bar{x} \]

Predicting the Outcome (\(\hat{y}\))

\[\hat{y}_i = \beta_0 + \beta_1 x_i \]

Let’s Write Some Code!

From Simple to Multiple Regression

What Happens When We Add More Predictors?

It’s Just a Regression With More Variables!

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon\]

Wrapping Up

Where Next, Magic Math Man?

What We Haven’t Covered

Model interpretation - Reading regression tables, understanding coefficients
Model evaluation - \(R^2\), Root Mean Squared Error (RMSE)
Assumptions of linear regression - Linearity, independence, homoscedasticity, no multicollinearity
Model diagnostics

The Magic Was Within You All Along

At its heart, regression is relatively simple. It is just fitting a line of best fit to your data.
The component parts of a regression model fit together to estimate the explained variance (the intercept and slope), and the unexplained variance (the error term). This allows us to make predictions and quantify the uncertainty of those predictions.
Linear regression is not magic (even if it often feels that way).
- The magic comes from good theory, good data, and attention to detail (but mostly good theory).

Further Resources

MLU-Explain - Linear Regression
Jeremy Balka (jbstatistics) - Simple Linear Regression
Josh Starmer (StatQuest) - Linear Regression and Linear Models
Jeremy Balka (jbstatistics) - Deriving the Least Squares Estimators of the Slope and Intercept
Andrew Gelman, Jennifer Hill, & Aki Vehtari - Regression & Other Stories

Thank You!

Contact:

Code & Slides:

/NHS-South-Central-and-West/linear-regression-from-scratch

References

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and Other Stories. Cambridge University Press.

Rowntree, Derek. 2018. Statistics Without Tears: An Introduction for Non-Mathematicians. Penguin Books.