Handling Missing Data

An Introduction to Missing Data Concepts & Methods of Resolution

Paul Johnson

What is Missing Data?

Understanding Types of Missing Data & Why it Matters

Sources of Missingness

  • There are three types of missing data (Rubin 1987):
    • Missing Completely at Random (MCAR) - Randomly missing, no relationship with other variables in the data, values randomly distributed
    • Missing at Random (MAR) - Values randomly distributed, but there is a relationship with other variables in the data
    • Missing Not at Random (MNAR) - Missingness is related to other variables and missing values are not random
  • The type and volume of missing data can seriously impact a variable’s distribution, the results of any analysis carried out on that data, and the validity of any solution.

Distribution Without Missing Values

plot_satisfaction <-
  function(data) {
    data |>
      ggplot(aes(x = years_at_company, fill = job_satisfaction)) +
      geom_histogram(colour = "grey20") +
      scwplot::scale_fill_sequential(
        palette = "blues",
        discrete = TRUE
      ) +
      labs(
        x = "Years at Company", y = NULL,
        title = "Employee Tenure by Job Satisfaction"
      )
  }

attrition |>
  plot_satisfaction()

Missing Completely at Random (MCAR)

set.seed(123)

attrition |>
  mutate(
    years_at_company = replace(
      years_at_company,
      runif(n()) < 0.5, NA
    )
  ) |>
  plot_satisfaction()

Missing at Random (MAR)

set.seed(123)

attrition |>
  mutate(
    years_at_company =
      replace(
        years_at_company,
        runif(n()) < 0.8 & job_satisfaction == "Low", NA
      )
  ) |>
  plot_satisfaction()

Missing Not at Random (MNAR)

attrition |>
  mutate(
    years_at_company =
      replace(
        years_at_company,
        years_at_company < 10 & 
          job_satisfaction == "Low", NA
      )
  ) |>
  plot_satisfaction()

Identifying Missingness Type

  • In practice, identifying missingness type is very difficult, requiring exploratory analysis, domain-expertise, and good judgement.
  • The {mice} and {ggmice} R packages offer functions for investigating missingess visually.
  • It is also possible to model missing values using logistic regression, treating missingness as the binary outcome.
  • However, none of these methods will give a definitive answer.
  • The starting assumption should always be that data is MNAR (Errickson 2017).

Dealing With Missing Data

How to Handle Missing Data in an Analysis

Common Approaches

  • The best solution for missing data is to find the data, whether by data collection or processing, or theory-driven inference.
  • When this is not possible there are two broad approaches to dealing with missing values:
    • Deletion - Listwise Deletion, Pairwise Deletion
    • Replacement - Mean/Median/Mode Imputation, Multiple Imputation, Regression Imputation
  • The right approach is highly dependent on the nature of the missing data.
  • Dealing with missing data requires understanding why it is missing first!

Deletion Methods

  • Listwise Deletion
    • Removing any observations (rows) that contain missing values for any relevant variables.
    • Analysis carried out on complete cases only.
  • Pairwise deletion
    • Removing any missing values, but not the entire observation (row).
    • Means and covariances are calculated on all observed, and these can be used to build statistical models (Van Buuren 2018).
  • When data is MCAR and the volumes of data that are missing is not an issue, listwise deletion is suitable.
  • When data is MCAR but volumes of data are a concern, pairwise deletion may be a better choice.
  • When data is MAR, deletion methods may still be valid under weaker assumptions (Errickson 2017).

Imputation Methods

  • Imputation involves replacing missing values with values inferred from the rest of the data.
  • There are two types of imputation:
    • Single Imputation - Replacing missing values with a single value, estimated from some statistical procedure.
    • Multiple Imputation - Creating multiple datasets each replacing missing values with plausible estimated values and pooling estimates from analyses carried out on each dataset.
  • Imputation methods can be valid whether data is MCAR, MAR, or MNAR.
    • However, single imputation is flawed and underestimates uncertainty.

Simple Imputation

  • There are lots of simple procedures for imputing missing values, usually involving replacing all missing values with the variable’s average value (usually the mean, but median and mode can be used too).
  • Another common method is to impute a constant value in place of missing values (such as 99, as often seen in survey data).
  • These methods are almost always a bad idea, because they have many flaws:
    • Distorts the variable’s distribution, underestimating it’s variance (Van Buuren 2018).
    • Disrupts the relationship between the variable with imputed values and all other variables (Nguyen 2020).
    • Biases model estimates (Van Buuren 2018).
  • It is important to know that these simple imputation methods exist, but the circumstances where using them is valid are very few!

Regression Without Imputation

plot_regression <-
  function(data) {
    data |>
      ggplot(aes(x = total_working_years, y = monthly_income)) +
      geom_point(shape = 21, size = 1.5, alpha = .5) +
      geom_smooth(
        method = lm, colour = "#005EB8",
        fill = "#005EB8", alpha = .5
      ) +
      labs(
        x = "Total Working Years", y = "Monthly Income",
        title = "Monthly Income ~ Total Length of Career"
      )
  }

attrition |>
  plot_regression()

Mean Imputation

set.seed(123)

missing_years <-
  attrition |>
  mutate(
    total_working_years =
      replace(
        total_working_years,
        runif(n()) < 0.8 &
          (job_level <= 2 | age > 35), NA
      )
  )

missing_years |>
  mice::mice(
    method = "mean", m = 1,
    maxit = 1, print = FALSE
  ) |>
  mice::complete() |>
  plot_regression()

Regression Imputation

  • A more robust approach is to estimate a regression model that predicts missing values.
  • Regression imputation could take the form of any regression model, depending on the type of data being imputed and how complex the model should be.
    • As always, linear and logistic regression are the most common approaches.
    • Modern machine learning imputation methods, using tree- and neighbour-based algorithms, can be understood as regression imputation, as they are also fitting predictive models.
  • However, Van Buuren (2018) argues regression imputation is the “most dangerous” imputation method, because it give false confidence in outputs.

Regression Imputation

missing_years |> 
  mice::mice(
    method = "norm.predict", m = 1,
    maxit = 1, print = FALSE
  ) |>
  mice::complete() |>
  plot_regression()

Stochastic Regression Imputation

missing_years |> 
  mice::mice(
    method = "norm.nob", m = 1, maxit = 1, 
    print = FALSE, seed = 123
  ) |>
  mice::complete() |>
  plot_regression()

The Problem with Single Imputation

Imputing one value for a missing datum cannot be correct in general, because we don’t know what value to impute with certainty (if we did, it wouldn’t be missing) (Rubin 1987).

Multiple Imputation

  • Multiple imputation involves generating multiple datasets, performing analysis on each, and pooling the results. This is a two-stage process:
    1. Generate multiple completed datasets, filling missing values using a statistical model that estimates imputation values, plus a random component to capture the uncertainty in the estimate.
    2. Compute estimates on each completed dataset before combining them as pooled estimates and standard errors, using Rubin (1987)’s formula (Murray 2018).
  • The methods used for each stage may differ, but this two-stage approach is generally consistent across all forms of multiple imputation.
  • This approach acknowledges the uncertainty in the imputation of missing values, and bakes that uncertainty into the process, instead of treating imputed values with equal weight/certainty as non-missing values.

Multiple Imputation

missing_years |> 
  mice::mice(
    method = "pmm", m = 30, maxit = 10, 
    print = FALSE, seed = 123
  ) |>
  mice::complete() |>
  plot_regression()

Regression Setup

set.seed(123)

missing_income <-
  attrition |>
  mutate(monthly_income = replace(monthly_income, runif(n()) < 0.8 & (job_level <= 2 | total_working_years > 10), NA),
         total_working_years = replace(total_working_years, runif(n()) < 0.5 & job_level >= 3, NA))

get_pooled_estimates <-
  function(data, method, m, maxit) {
    data |>
      mice::mice(method = method, m = m, maxit = maxit, print = FALSE, seed = 123) |>
      with(glm(factor(attrition) ~ arm::rescale(monthly_income) + total_working_years, family = "binomial")) |> 
      mice::pool()
  }

no_imp <- glm(factor(attrition) ~ arm::rescale(monthly_income) + total_working_years, family = "binomial", data = attrition)
mean_imp <- missing_income |> get_pooled_estimates(method = "mean", m = 1, maxit = 1)
norm_imp <- missing_income |> get_pooled_estimates(method = "norm.predict", m = 1, maxit = 1)
pmm_imp <- missing_income |> get_pooled_estimates(method = "pmm", m = 50, maxit = 20)

Regression Estimates

models <-
  list("No Imputation" = no_imp,
       "Mean" = mean_imp,
       "Regression" = norm_imp,
       "Predictive Mean Matching" = pmm_imp)

cm <- c("(Intercept)" = "(Intercept)",
        "arm::rescale(monthly_income)" = "Monthly Income",
        "total_working_years" = "Total Working Years")

modelsummary::modelsummary(
  models, exponentiate = TRUE, output = "gt",
  coef_map = cm, gof_omit = "IC|Log|F|RMSE",
  title = "Logstic Regressions of Job Attrition"
  ) |>
  gt::tab_spanner(label = "Single Imputation", columns = 3:4) |> 
  gt::tab_spanner(label = "Multiple Imputation", columns = 5)
Logstic Regressions of Job Attrition
No Imputation Single Imputation Multiple Imputation
Mean Regression Predictive Mean Matching
(Intercept) 0.301 0.423 0.268 0.303
(0.058) (0.062) (0.056) (0.072)
Monthly Income 0.550 0.865 0.444 0.564
(0.154) (0.142) (0.134) (0.227)
Total Working Years 0.950 0.913 0.959 0.949
(0.016) (0.014) (0.017) (0.019)
Num.Obs. 1470 1470 1470 1470
Num.Imp. 50

Other Imputation Methods

  • Machine learning imputation - including k-nearest neighbours (KNN), classification and regression trees (CART), and random forests.
    • Some research suggests they can produce accurate imputations, but when used in a single imputation process they still risk overconfidence and underestimation of uncertainty.
    • When a single complete dataset is necessary, and multiple imputation is not possible, machine learning imputation may be appropriate.
    • KNN has proven to be both popular and accurate as an imputation method in machine learning.
  • Multilevel multiple imputation - when there is a clustering structure underpinning the missing data, multilevel models can be effective.
  • Time-series imputation - including last-observation carried forward (LOCF), moving averages, interpolation, Kalman smoothing.

Conclusion

  • Not dealing with missing values is a methodological choice in and of itself.
    • Statistical tools will deal with missing values (usually listwise deletion), and this has consequences.
    • Those consequences might be acceptable, but that decision should be made explicitly.
  • How missing values should be handled depends on the nature, and potentially the volume, of the missingness.
  • Single imputation methods are quick and easy but are often extremely flawed, because they underestimate uncertainty.
  • The best method for dealing with missing values is to find them, but when that is not possible, multiple imputation can be a powerful solution.

Further Resources

Thank You!

Contact:

Code & Slides:

References

Errickson, Josh. 2017. “Statistics 701: Special Topics in Applied Statistics II - Multiple Imputation.” University of Michigan. https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/mi.html.
Murray, Jared S. 2018. “Multiple Imputation: A Review of Practical and Theoretical Findings.” https://arxiv.org/pdf/1801.04058.pdf.
Nguyen, Mike. 2020. A Guide on Data Analysis. Bookdown. https://bookdown.org/mike/data_analysis/.
Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley.
Van Buuren, Stef. 2018. Flexible Imputation of Missing Data. CRC Press. https://stefvanbuuren.name/fimd/.