Simulation-Based Hypothesis Testing

There Really is Only One Test

Paul Johnson

Hypothesis Testing

Issues of Significance

What is a Hypothesis?

A falsifiable statement about the world that forms the basis for scientific enquiry.
A hypothesis posits the effect we expect to observe in our sample data, in order to make generalisable statements about the effect in the population (Blackwell 2023).
Hypotheses are derived from theories – what we should expect to observe in our data, given the theory.
Quantitative research seeks to test hypotheses, and the results are a step closer to drawing inferences about the world.

Hypothesis Testing

Hypothesis tests measure the compatibility of the observed data with what we should observe if the hypothesis (and all other assumptions of the test) is true.
A hypothesis test quantifies our confidence that what we observe in the sample did not occur by chance (and is therefore generalisable to the population).
The Null Hypothesis Significance Testing (NHST) framework is the most common approach to testing hypotheses.
- Null Hypothesis (\(H_0\)) = No effect in the population
- Alternative Hypothesis (\(H_1\)) = The effect in the population is not equal to zero
A hypothesis test in NHST seeks to reject the null, which provides support for (but does not confirm) the alternative hypothesis.
NHST is controversial, but it is pervasive across science.

A Common Testing Framework

Set test (often null) hypothesis.
Generate test distribution – the data distribution we should expect to observe if test hypothesis is true (and all other assumptions met).
Compute test statistic – quantifying how extreme the observed data distribution is given the test distribution.
Compute p-value – quantifying the probability of observing a test statistic as large or larger if test hypothesis is true.

One-Sample T-Test

set.seed(42)

score_distributions <- 
  tibble(
    observed = rnorm(50, mean = 105, sd = 15),
    null = rnorm(50, mean = 100, sd = 15)
    ) |> 
  tidyr::pivot_longer(
    cols = everything(), names_to = "dist", values_to = "value"
    )

score_distributions |> 
  ggplot(aes(value, fill = dist)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "IQ Scores Observed & Null Distributions (n = 50)",
    x = "IQ Score", y = NULL
    ) +
  scwplot::scale_fill_qualitative(palette = "scw")

One-Sample T-Test

set.seed(42)

smaller_sample <- 
  tibble(iq_score = rnorm(50, mean = 105, sd = 15))


t_test(
  smaller_sample, response = iq_score, 
  mu = 100, alternative = "two-sided"
  ) |> 
  mutate(
    across(where(is.numeric), ~round(.x, 2))
    ) |>
  select(statistic, p_value, estimate, lower_ci, upper_ci) |> 
  gt::gt()

IQ Scores T-Test (n = 50)
statistic	p_value	estimate	lower_ci	upper_ci
1.83	0.07	104.46	99.56	109.37

One-Sample T-Test

set.seed(42)

score_distributions <- 
  tibble(
    observed = rnorm(75, mean = 105, sd = 15),
    null = rnorm(75, mean = 100, sd = 15)
    ) |> 
  tidyr::pivot_longer(
    cols = everything(), names_to = "dist", values_to = "value"
    )

score_distributions |> 
  ggplot(aes(value, fill = dist)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "IQ Scores Observed & Null Distributions (n = 75)",
    x = "IQ Score", y = NULL
    ) +
  scwplot::scale_fill_qualitative(palette = "scw")

One-Sample T-Test

set.seed(42)

larger_sample <- 
  tibble(iq_score = rnorm(75, mean = 105, sd = 15))

t_test(
  larger_sample, response = iq_score, 
  mu = 100, alternative = "two-sided"
  ) |> 
  mutate(
    across(where(is.numeric), ~round(.x, 2))
    ) |>
  select(statistic, p_value, estimate, lower_ci, upper_ci) |> 
  gt::gt()

IQ Scores T-Test (n = 75)
statistic	p_value	estimate	lower_ci	upper_ci
2.83	0.01	105.36	101.59	109.12

A Test for Every Eventuality

T-tests (one sample, paired, two-samples)
Chi-squared tests
ANOVA
Mann-Whitney u-test
Wilcoxon signed rank test
Fisher exact test
McNemar test
Kruskal-Wallis test
And probably thousands more…

THERE MUST BE A BETTER WAY

Simulation-Based Hypothesis Tests

Simulated Testing Framework

All hypothesis tests are trying to do the same thing – compare the observed data against a test distribution.
We can leverage this and, instead, simulate the data distribution that our test hypothesis should produce.
We just need a test statistic (a measurement of the size of the effect, like absolute difference in means), our test/null hypothesis and a model for generating a distribution from it, and a method for computing the p-value (Downey 2016).

Simulating a T-Test

t <- 
  smaller_sample |> 
  specify(response = iq_score) |> 
  calculate(stat = "mean")

null <- 
  smaller_sample |>
  specify(response = iq_score) |> 
  hypothesize(null = "point", mu = 100) |> 
  generate(reps = 1000, type = "bootstrap") |> 
  calculate(stat = "mean")

p <- null |> get_p_value(obs_stat = t, direction = "two-sided")

null |>
  visualize() + 
  shade_p_value(t, direction = "two-sided") +
  annotate(
    "text", x = 95, y = 125, 
    label = paste0("t = ", round(t, 2), "\n p = ", round(p, 2)),
    size = rel(6), color="grey30"
    ) +
  labs(x = "IQ Score", y = NULL, title = NULL)

Simulating a T-Test

t <- 
  larger_sample |> 
  specify(response = iq_score) |> 
  calculate(stat = "mean")

null <- 
  larger_sample |>
  specify(response = iq_score) |> 
  hypothesize(null = "point", mu = 100) |> 
  generate(reps = 1000, type = "bootstrap") |> 
  calculate(stat = "mean")

p <- null |> get_p_value(obs_stat = t, direction = "two-sided")

null |>
  visualize() + 
  shade_p_value(t, direction = "two-sided") +
  annotate(
    "text", x = 95, y = 125, 
    label = paste0("t = ", round(t, 2), "\n p = ", round(p, 2)),
    size = rel(6), color="grey30"
    ) +
  labs(x = "IQ Score", y = NULL, title = NULL)

Advantages of the Simulation Approach

There is only one test!
This approach is transparent, quick, and flexible.
Building tests from simulations is an excellent way to gain an intuition for how hypothesis testing works.
Learning to use simulations for checking assumptions and considering the implications of your model is good practice.

Conclusion

There is an endless supply of statistical tests for every test statistic, data distribution, and method for calculating the p-value.
Most of these statistical tests are just linear models anyway.
They are all doing the same thing. There is only one test.
Use simulation-based hypothesis testing and never have to learn what a Wilcoxon signed rank test actually is.

Further Resources

Thank You!

Contact:

Code & Slides:

/NHS-South-Central-and-West/simulation-based-tests

References

Blackwell, Matthew. 2023. “A User’s Guide to Statistical Inference and Regression.” https://mattblackwell.github.io/gov2002-book/.

Downey, Allen. 2016. “There Is Still One One Test.” Probably Overthinking It. https://allendowney.substack.com/p/there-is-still-only-one-test.