Hypothesis Testing

Issues of Significance

What is a Hypothesis?

  • A falsifiable statement about the world that forms the basis for scientific enquiry.
  • A hypothesis posits the effect we expect to observe in our sample data, in order to make generalisable statements about the effect in the population (Blackwell 2023).
  • Hypotheses are derived from theories – what we should expect to observe in our data, given the theory.
  • Quantitative research seeks to test hypotheses, and the results are a step closer to drawing inferences about the world.

Hypothesis Testing

  • Hypothesis tests measure the compatibility of the observed data with what we should observe if the hypothesis (and all other assumptions of the test) is true.
  • A hypothesis test quantifies our confidence that what we observe in the sample did not occur by chance (and is therefore generalisable to the population).
  • The Null Hypothesis Significance Testing (NHST) framework is the most common approach to testing hypotheses.
    • Null Hypothesis (\(H_0\)) = No effect in the population
    • Alternative Hypothesis (\(H_1\)) = The effect in the population is not equal to zero
  • A hypothesis test in NHST seeks to reject the null, which provides support for (but does not confirm) the alternative hypothesis.
  • NHST is controversial, but it is pervasive across science.

A Common Testing Framework

  1. Set test (often null) hypothesis.
  2. Generate test distribution – the data distribution we should expect to observe if test hypothesis is true (and all other assumptions met).
  3. Compute test statistic – quantifying how extreme the observed data distribution is given the test distribution.
  4. Compute p-value – quantifying the probability of observing a test statistic as large or larger if test hypothesis is true.

One-Sample T-Test

set.seed(42)

score_distributions <- 
  tibble(
    observed = rnorm(50, mean = 105, sd = 15),
    null = rnorm(50, mean = 100, sd = 15)
    ) |> 
  tidyr::pivot_longer(
    cols = everything(), names_to = "dist", values_to = "value"
    )

score_distributions |> 
  ggplot(aes(value, fill = dist)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "IQ Scores Observed & Null Distributions (n = 50)",
    x = "IQ Score", y = NULL
    ) +
  scwplot::scale_fill_qualitative(palette = "scw")

One-Sample T-Test

set.seed(42)

smaller_sample <- 
  tibble(iq_score = rnorm(50, mean = 105, sd = 15))


t_test(
  smaller_sample, response = iq_score, 
  mu = 100, alternative = "two-sided"
  ) |> 
  mutate(
    across(where(is.numeric), ~round(.x, 2))
    ) |>
  select(statistic, p_value, estimate, lower_ci, upper_ci) |> 
  gt::gt()
IQ Scores T-Test (n = 50)
statistic p_value estimate lower_ci upper_ci
1.83 0.07 104.46 99.56 109.37

One-Sample T-Test

set.seed(42)

score_distributions <- 
  tibble(
    observed = rnorm(75, mean = 105, sd = 15),
    null = rnorm(75, mean = 100, sd = 15)
    ) |> 
  tidyr::pivot_longer(
    cols = everything(), names_to = "dist", values_to = "value"
    )

score_distributions |> 
  ggplot(aes(value, fill = dist)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "IQ Scores Observed & Null Distributions (n = 75)",
    x = "IQ Score", y = NULL
    ) +
  scwplot::scale_fill_qualitative(palette = "scw")

One-Sample T-Test

set.seed(42)

larger_sample <- 
  tibble(iq_score = rnorm(75, mean = 105, sd = 15))

t_test(
  larger_sample, response = iq_score, 
  mu = 100, alternative = "two-sided"
  ) |> 
  mutate(
    across(where(is.numeric), ~round(.x, 2))
    ) |>
  select(statistic, p_value, estimate, lower_ci, upper_ci) |> 
  gt::gt()
IQ Scores T-Test (n = 75)
statistic p_value estimate lower_ci upper_ci
2.83 0.01 105.36 101.59 109.12

A Test for Every Eventuality

  • T-tests (one sample, paired, two-samples)
  • Chi-squared tests
  • ANOVA
  • Mann-Whitney u-test
  • Wilcoxon signed rank test
  • Fisher exact test
  • McNemar test
  • Kruskal-Wallis test
  • And probably thousands more…

THERE MUST BE A BETTER WAY

Simulation-Based Hypothesis Tests

Simulated Testing Framework

  • All hypothesis tests are trying to do the same thing – compare the observed data against a test distribution.
  • We can leverage this and, instead, simulate the data distribution that our test hypothesis should produce.
  • We just need a test statistic (a measurement of the size of the effect, like absolute difference in means), our test/null hypothesis and a model for generating a distribution from it, and a method for computing the p-value (Downey 2016).

Simulating a T-Test

t <- 
  smaller_sample |> 
  specify(response = iq_score) |> 
  calculate(stat = "mean")

null <- 
  smaller_sample |>
  specify(response = iq_score) |> 
  hypothesize(null = "point", mu = 100) |> 
  generate(reps = 1000, type = "bootstrap") |> 
  calculate(stat = "mean")

p <- null |> get_p_value(obs_stat = t, direction = "two-sided")

null |>
  visualize() + 
  shade_p_value(t, direction = "two-sided") +
  annotate(
    "text", x = 95, y = 125, 
    label = paste0("t = ", round(t, 2), "\n p = ", round(p, 2)),
    size = rel(6), color="grey30"
    ) +
  labs(x = "IQ Score", y = NULL, title = NULL)

Simulating a T-Test

t <- 
  larger_sample |> 
  specify(response = iq_score) |> 
  calculate(stat = "mean")

null <- 
  larger_sample |>
  specify(response = iq_score) |> 
  hypothesize(null = "point", mu = 100) |> 
  generate(reps = 1000, type = "bootstrap") |> 
  calculate(stat = "mean")

p <- null |> get_p_value(obs_stat = t, direction = "two-sided")

null |>
  visualize() + 
  shade_p_value(t, direction = "two-sided") +
  annotate(
    "text", x = 95, y = 125, 
    label = paste0("t = ", round(t, 2), "\n p = ", round(p, 2)),
    size = rel(6), color="grey30"
    ) +
  labs(x = "IQ Score", y = NULL, title = NULL)

Advantages of the Simulation Approach

  • There is only one test!
  • This approach is transparent, quick, and flexible.
  • Building tests from simulations is an excellent way to gain an intuition for how hypothesis testing works.
  • Learning to use simulations for checking assumptions and considering the implications of your model is good practice.

Conclusion

  • There is an endless supply of statistical tests for every test statistic, data distribution, and method for calculating the p-value.
  • Most of these statistical tests are just linear models anyway.
  • They are all doing the same thing. There is only one test.
  • Use simulation-based hypothesis testing and never have to learn what a Wilcoxon signed rank test actually is.

Further Resources

Thank You!

Contact:

Code & Slides:

References

Blackwell, Matthew. 2023. “A User’s Guide to Statistical Inference and Regression.” https://mattblackwell.github.io/gov2002-book/.
Downey, Allen. 2016. “There Is Still One One Test.” Probably Overthinking It. https://allendowney.substack.com/p/there-is-still-only-one-test.