Introduction to Multilevel Regression

Moving Beyond Single-Level Models & Dealing With Clustering in Data

Paul Johnson

What is Multilevel Data?

Understanding Grouping Structures in Data

Multilevel Data is Everywhere

Typical regression methods assume a flat single-level data structure, which does not account for the complexity that we often find in real-world data.
- This means that single-level regression methods assume that all observations are independent of each other.
It is very common to find that there are “grouping structures” in your data. This is multilevel data!
Multilevel data is so common that our starting assumption should be that any data has grouping structures that need to be accounted for (McElreath 2017).
Whenever you have multiple measurements per group, across many grouping units, you have multilevel data (Thieu 2020)!

Many Sources of Clustering

Multilevel data can take lots of different forms, some more obvious than others.
The most intuitive (and probably most common) type of grouping structure is hierarchical.
- Observations are clustered based on some higher-level grouping. Some common examples include countries, cities, and school classes.
Not all grouping structures are hierarchical. Data can also cluster at the observation-level.
- Characteristics like gender, ethnicity, or social class, can lead to clustering.
- Another very common form of observation-level grouping is “repeated measurements”.
Observations within groups that are more similar to each other than to observations in other groups indicate a multilevel problem.

Population-Level Grouping Structures

avocados |> 
  ggplot(aes(log(total_volume), fill = type)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(x = "Total Volume", y = NULL) +
  scwplot::scale_fill_qualitative(palette = "scw")

Hierarchical Grouping Structures

avocados |> 
  filter(region %in% c("Houston", "Seattle", "Syracuse")) |> 
  ggplot(aes(log(total_volume), fill = type)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  facet_wrap(~ region, nrow = 3) +
  labs(x = "Total Volume", y = NULL) +
  scwplot::scale_fill_qualitative(palette = "scw")

Hierarchical Grouping Structures

avocados |> 
  filter(region %in% c("Houston", "Seattle", "Syracuse")) |> 
  ggplot(aes(average_price, log(total_volume), colour = region)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, se = FALSE) +
  facet_wrap(~ type) +
  labs(x = "Average Price", y = "Total Volume") +
  scwplot::scale_colour_qualitative(palette = "scw")

Multilevel Problems

Issues with Using Single-Level Models for Multilevel Problems

Why Does it Matter?

When you have multilevel data, it is necessary to think about variance within groups and between groups.
- Observations within groups are often more alike than observations between groups.
Single-level regression models do not account for the complexity of multilevel data, either ignoring it entirely (complete pooling) or treating groups as independent (no pooling).
Failure to account for grouping structures in data can violate assumptions and bias estimates.

Assumption of Independence

The independence assumption is one of the four assumptions of linear regression.
It assumes that your residuals are all independent - observations (and their residuals) should not be correlated.
When there is clustering among observations, this violates the independence assumption.
This is the worst assumption to violate!
- Artificially inflates probability estimates.
- Creates the appearance of certainty where it does not exist.

Simpson’s Paradox

datasauRus::simpsons_paradox |> 
  ggplot(aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  facet_wrap(~ dataset, nrow = 2) +
  scwplot::scale_colour_qualitative(palette = "scw")

Multilevel Regression

Multilevel Solutions for Multilevel Problems!

What is Multilevel Regression?

Single-level models treat clusters as unrelated to others, forgetting everything they’ve learned from each cluster. If the clusters are of the same type, this is leaving valuable information on the table (McElreath 2017).
- Multilevel models (MLMs) remember features of each cluster in the data as they learn about all of the clusters (McElreath 2018).
- Pooling information across groups/clusters helps improve estimates about each group/cluster.
MLMs allow us to fit regression models to individual level while accounting for systematic unexplained variation among groups (Gelman 2006).
MLMs are appropriate when you care about group differences, when you have many groups, and when there are imbalances among groups.

Multilevel Regression Types

MLMs are very flexible, and can be fit to many different types of multilevel data.
They can be fit to data with many different levels, various kinds of grouping structures, and all kinds of relationships between variables.
There are three main types of MLMs - varying intercepts, varying slopes, and varying intercepts and slopes.

Estimating Group Effects

We can build a “null model” to examine how much of the variance in the data can be explained by the grouping structure.

lmer(total_volume ~ 1 + (1 | region), data = avocados) |> 
  performance::icc() |> 
  janitor::clean_names(replace = c("ICC" = ""), "title") |>
  tidyr::pivot_longer(everything(), names_to = "ICC", values_to = "Value") |>
  gt() |>
  tab_header("Intraclass Correlation Coefficient (ICC)") |>
  fmt_number(columns = is.numeric, decimals = 2) |>
  tab_options(table.width = pct(100))

Intraclass Correlation Coefficient (ICC)
ICC	Value
Adjusted	0.38
Conditional	0.38
Unadjusted	0.38

Varying Intercepts

Varying intercepts are the most common type of multilevel model (and the easiest to understand).
Instead of fitting a model that pools all information about groups and fits a single line to the data, it fits a line to each group.
We are estimating the model but letting each group have a different value at 0.

Varying Intercepts

varying_intercepts <-       
  lmer(
    log(total_volume) ~ average_price + organic + (1 | region), 
    data = avocados
    )

modelsummary::msummary(
  list("Total Volume" = varying_intercepts),
  statistic = 'conf.int', coef_map = cm, gof_omit = "AIC|BIC|R2", 
  fmt = 2, exponentiate = TRUE, output = "gt",
  title = "Multilevel Regression of Avocado Sales"
  ) |>
  tab_row_group(md("**Group Effects**"), rows = 7:8) |>
  tab_row_group(md("**Population Effects**"), rows = 1:6) |>
  tab_style(
    style = cell_text(size = "x-small"), 
    locations = cells_body(columns = 2, rows = c(2, 4, 6))
  ) |> 
  tab_options(table.width = pct(100), table.font.size = 12)

Multilevel Regression of Avocado Sales
	Total Volume
Population Effects
(Intercept)	732844.18
	[564203.10, 951892.30]
Average Price	0.52
	[0.51, 0.53]
Organic	0.04
	[0.04, 0.04]
Group Effects
Region Intercept Variance	2.46
Residual Variance	1.57

Num.Obs.	15545
ICC	0.8
RMSE	0.45

Varying Intercepts

varying_intercepts |> 
  ggeffects::predict_response(c("average_price", "organic")) |> 
  tibble() |> 
  mutate(type = if_else(group == 0, "Conventional", "Organic")) |> 
  ggplot(aes(x, predicted, group = type, colour = type)) +
  geom_point(size = 1.5) +
  geom_line(linewidth = 1) +
  geom_line(aes(y = conf.low), linetype = 2) +
  geom_line(aes(y = conf.high), linetype = 2) +
  scale_x_continuous(labels = label_currency()) +
  scale_y_continuous(
    labels = label_number(scale_cut = cut_short_scale())
    ) +
  labs(x = "Average Price", y = "Total Volume") +
  scwplot::scale_colour_qualitative("scw")

Varying Intercepts

varying_intercepts |> 
  ggeffects::predict_response(
    c("average_price", "organic", "region"), type = "random"
    ) |>
  tibble() |> 
  filter(facet %in% c("Houston", "Seattle", "Syracuse")) |> 
  mutate(type = if_else(group == 0, "Conventional", "Organic")) |>
  ggplot(aes(x, predicted, group = type, colour = type)) +
  geom_point(size = 1.5) +
  geom_line(linewidth = 1) +
  geom_line(aes(y = conf.low), linetype = 2) +
  geom_line(aes(y = conf.high), linetype = 2) +
  facet_wrap(facets = vars(facet), nrow = 3) +
  scale_x_continuous(labels = label_currency()) +
  scale_y_continuous(
    labels = label_number(scale_cut = cut_short_scale())
    ) +
  labs(x = "Average Price", y = "Total Volume") +
  scwplot::scale_colour_qualitative("scw")

Conclusion

Multilevel data is very common, and we should start by assuming our data has grouping structures in it.
Not accounting for those grouping structures can cause a lot of problems.
Multilevel regression is flexible, powerful, and incredibly effective!

Thank You!

Contact:

Code & Slides:

/NHS-South-Central-and-West/handling-missing-values

References

Gelman, Andrew. 2006. “Multilevel (Hierarchical) Modeling: What It Can and Cannot Do.” Technometrics 48 (3): 432–35. http://www.stat.columbia.edu/~gelman/research/published/multi2.pdf.

McElreath, Richard. 2017. “Multilevel Regression as Default.” https://elevanth.org/blog/2017/08/24/multilevel-regression-as-default/.

———. 2018. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Chapman; Hall/CRC.

Thieu, Monica. 2020. “Wrangling Multilevel Data in the Tidyverse.” https://www.monicathieu.com/posts/2020-04-08-tidy-multilevel.