Building a Simple Linear Model to Better Understand Regression Methods
Get in Loser, We’re Fitting Lines to Data
model <- lm(exam_score ~ hours_studied, data = df)
df <-
df |>
mutate(
fitted = model$fitted.values,
residual = model$residuals
)
df |>
select(hours_studied, exam_score, fitted, residual) |>
janitor::clean_names(case = "title") |>
slice_sample(n = 10) |>
gt() |>
fmt_number(columns = c(Fitted, Residual), decimals = 2) |>
cols_align(align = "center", columns = everything())
Hours Studied | Exam Score | Fitted | Residual |
---|---|---|---|
12 | 59 | 64.74 | −5.74 |
23 | 65 | 67.95 | −2.95 |
16 | 63 | 65.91 | −2.91 |
18 | 71 | 66.49 | 4.51 |
21 | 67 | 67.37 | −0.37 |
16 | 63 | 65.91 | −2.91 |
19 | 69 | 66.79 | 2.21 |
2 | 66 | 61.82 | 4.18 |
30 | 67 | 70.00 | −3.00 |
26 | 68 | 68.83 | −0.83 |
coefs <- model$coefficients
df <-
df |>
mutate(
fitted_under = coefs[1] + (coefs[2] - 0.1) * hours_studied,
residual_under = exam_score - fitted_under
)
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_under),
linewidth = 1, colour = "#005EB8"
) +
labs(
x = "Hours Studied", y = "Exam Score"
)
df <-
df |>
mutate(
fitted_over = coefs[1] + (coefs[2] + 0.2) * hours_studied,
residual_over = exam_score - fitted_over
)
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_over),
linewidth = 1, colour = "#005EB8"
) +
labs(
x = "Hours Studied", y = "Exam Score"
)
set.seed(42)
df |>
slice_sample(n = 50) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted_under,
xend = hours_studied, yend = fitted_under + residual_under
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_under),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")
set.seed(42)
df |>
slice_sample(n = 50) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted_over,
xend = hours_studied, yend = fitted_over + residual_over
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_over),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")
set.seed(42)
df |>
slice_sample(n = 50) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")
\[\min_{\hat\beta_0,\hat\beta_1} \sum_{i=1}^n \left( y_i - (\hat\beta_0 + \hat\beta_1 x_i) \right)^2\]
Solving for \(\beta_0\):
Differentiate: \(\frac{\partial}{\partial \hat{\beta}_0} \text{RSS} = -2 \sum_{i=1}^n \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)\)
Solve: \(\hat\beta_0 = \bar{y} - \hat\beta_1 \bar{x}\)
Solving for \(\beta_1\):
Differentiate: \(\frac{\partial}{\partial \hat{\beta}_1} \text{RSS} = -2 \sum_{i=1}^n x_i \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)\)
Solve: \(\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\)
Our good friends \(\beta_1\), \(\beta_0\), and \(\epsilon\).
\[ Y = \underbrace{\vphantom{\beta_0} \overset{\color{#41B6E6}{\text{Intercept}}}{\color{#41B6E6}{\beta_0}} + \overset{\color{#005EB8}{\text{Slope}}}{\color{#005EB8}{\beta_1}}X \space \space}_{\text{Explained Variance}} + \overset{\mathstrut \color{#ED8B00}{\text{Error}}}{\underset{\text{Unexplained}}{\color{#ED8B00}{\epsilon}}} \]
\[\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
\[\beta_0 = \bar{y} - \beta_1 \bar{x} \]
\[\hat{y}_i = \beta_0 + \beta_1 x_i \]
What Happens When We Add More Predictors?
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon\]
Where Next, Magic Math Man?
Contact:
Code & Slides:
Paul Johnson // Linear Regression from Scratch // Dec 18, 2024