library(tidyverse)
library(broom)
library(patchwork)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())
\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}\]
In hypothesis testing, we start with some default theory – called a null hypothesis – and we ask if the data provide sufficient evidence to reject the theory. If not, we fail to reject the null hypothesis.
Formally, suppose we partition the parameter space \(\Theta\) into two disjoint sets \(\Theta_0\) and \(\Theta_1\) and that we wish to test
\[H_0: \theta \in \Theta_0 \quad \text{versus} \quad H_1: \theta \in \Theta_1\]
Let \(X\) be a random variable and let \(\chi\) be the range of \(X\). We test a hypothesis by finding an appropriate subset of outcomes \(R \subset \chi\) called the rejection region. If \(X \subset R\) we reject the null hypothesis, otherwise we do not reject the null hypothesis. Usually the rejection region \(R\) is of the form
\[R = \left\{ x: T(x) > c \right\}\]
where \(T\) is a test statistic and \(c\) is a critical value. Hypothesis testing requires us to find an appropriate test statistic \(T\) and an appropriate critical value \(c\) to test a given hypothesis. Different hypotheses require different test statistics.
Hypothesis testing is not error-proof. We start from the assumption that \(H_0\) is true unless there is strong evidence to reject \(H_0\). Rejecting \(H_0\) when \(H_0\) is true is a type I error (false positive), while retaining \(H_0\) when \(H_1\) is true is called a type II error (false negative).
The power function of a test with rejection region \(R\) is defined by
\[\beta(\theta) = \Pr_\theta (X \in R)\]
The size of a test is defined to be
\[\alpha = \text{sup}_{\theta \in \Theta_0} \beta(\theta)\]
A test is said to have level \(\alpha\) if its size is less than or equal to \(\alpha\).
A test of the form
\[H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0\]
is called a two-sided test, or a simple hypothesis. A test of the form
\[H_0: \theta \leq \theta_0 \quad \text{versus} \quad H_1: \theta > \theta_0\]
or
\[H_0: \theta \geq \theta_0 \quad \text{versus} \quad H_1: \theta < \theta_0\]
is called a one-sided test, or a composite hypothesis.
Let \(X_1, \ldots, X_n \sim N(\mu, \sigma^2)\) where \(\sigma\) is known. We want to test \(H_0: \mu \leq 0\) versus \(H_1: \mu > 0\). Hence, \(\Theta_0 = (-\infty, 0]\) and \(\Theta_1 = (0, \infty]\). Consider the test
\[\text{reject } H_0 \text{ if } T>c\]
where \(T = \bar{X}\). The rejection region is
\[R = \left\{(x_1, \ldots, x_n): T(x_1, \ldots, x_n) > c \right\}\]
Let \(Z\) denote the standard Normal random variable. The power function is
\[ \begin{align} \beta(\mu) &= \Pr_\mu (\bar{X} > c) \\ &= \Pr_\mu \left(\frac{\sqrt{n} (\bar{X} - \mu)}{\sigma} > \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \\ &= \Pr_\mu \left(Z > \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \\ &= 1 - \Phi \left( \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \end{align} \]
This function is increasing in \(\mu\):
data_frame(x = seq(-3, 3, by = 0.01),
y = pnorm(x),
h = x > -0) %>%
ggplot(aes(x, y, color = h)) +
geom_line(size = 1) +
geom_vline(xintercept = 0) +
scale_color_brewer(type = "qual", labels = expression(H[0], H[1])) +
annotate(geom = "text", x = -0.25, y = .8, label = expression(beta(mu))) +
labs(title = "Power function for test statistic",
x = expression(mu),
y = expression(alpha),
color = NULL)
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
Hence
\[\alpha = \text{sup}_{\mu \leq 0} \beta(\mu) = \beta(0) = 1 - \Phi \left( \frac{\sqrt{n} (c)}{\sigma} \right)\]
For a size \(\alpha\) test, we set this equal to \(\alpha\) and solve for \(c\) to get
\[c = \frac{\sigma \Phi^{-1} (1 - \alpha)}{\sqrt{n}}\]
We reject \(H_0\) when
\[\bar{X} > \frac{\sigma \Phi^{-1} (1 - \alpha)}{\sqrt{n}}\]
Equivalently, we reject when
\[\frac{\sqrt{n}(\bar{X} - 0)}{\sigma} > z_\alpha\]
where \(z_\alpha = \Phi^{-1} (1 - \alpha)\).
Ideally we find the test with the highest power under \(H_1\) among all size \(\alpha\) tests. In practice, we use many of the same commonly used tests.
Let \(\theta\) be a scalar parameter, let \(\hat{\theta}\) be an estimate of \(\theta\), and let \(\widehat{\se}\) be the estimated standard error of \(\hat{\theta}\). Consider testing
\[H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0\]
Assume that \(\hat{\theta}\) is asymptotically Normal:
\[\frac{\hat{\theta} - \theta_0}{\widehat{\se}} \leadsto N(0,1)\]
The size \(\alpha\) Wald test is: reject \(H_0\) when \(|W| > z_{\alpha / 2}\) where
\[W = \frac{\hat{\theta} - \theta_0}{\widehat{\se}}\]
Suppose the true value of \(\theta\) is \(\theta_* \neq \theta_0\). The power \(\beta(\theta_*)\) – the probability of correctly rejecting the null hypothesis – is given (approximately) by
\[1 - \Phi \left( \frac{\hat{\theta} - \theta_0}{\widehat{\se}} + z_{\alpha/2} \right) + \Phi \left( \frac{\hat{\theta} - \theta_0}{\widehat{\se}} - z_{\alpha/2} \right)\]
Remember this is a two-tailed test. Essentially we are collecting the probability mass in the center of the standard normal distribution and subtracting that from 1, to get the area in the tails of the distribution. Hence, two-tailed test.
Recall that \(\widehat{\se}\) tends to 0 as the sample size increases. So we can note that:
Let \(X_1, \ldots, X_m\) and \(Y_1, \ldots, Y_n\) be two independent samples from populations with means \(\mu_1, \mu_2\) respectively. Let’s test the null hypothesis that \(\mu_1 = \mu_2\). Write this as
\[H_0: \delta = 0 \quad \text{versus} \quad H_1: \delta \neq 0\]
where \(\delta = \mu_1 - \mu_2\). The estimate of \(\delta\) is \(\hat{\delta} = \bar{X} - \bar{Y}\) with estimated standard error
\[\widehat{\se} = \sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}\]
where \(s_1^2\) and \(s_2^2\) are the sample variances. The size \(\alpha\) Wald test rejects \(H_0\) when \(|W| > z_{\alpha / 2}\) where
\[W = \frac{\hat{\delta} - 0}{\widehat{\se}} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}}\]
\[f(t) = \frac{\Gamma (\frac{k+1}{2})}{\sqrt{k\pi} \Gamma (\frac{k}{2}) } (1 + \frac{t^2}{k})^{-\frac{k + 1}{2}}\]
expand.grid(x = seq(-4, 4, by = 0.01),
df = c(1, 3, 5, 10, 30, Inf)) %>%
as_tibble() %>%
mutate(y = dt(x, df),
df = factor(df, labels = c(1, 3, 5, 10, 30, "Normal"))) %>%
ggplot(aes(x, y, color = df)) +
geom_line() +
scale_color_brewer(type = "qual") +
labs(title = "t-Distribution",
x = expression(X),
y = "PDF",
color = "Degrees of freedom") +
theme(legend.position = "bottom")
When the degrees of freedom \(k \rightarrow \infty\), this tends to a Normal distribution.
To test \(H_0: \mu = \mu_0\) where \(\mu = \E[X_i]\) is the mean, we can use the Wald test. When the data are assumed to be Normal and the sample size is small, it is common to use the \(t\)-test.
There is a relationship between the Wald test and the \(1 - \alpha\) asymptotic confidence interval \(\hat{\theta} \pm \widehat{\se} z_{\alpha/2}\). The size \(\alpha\) Wald test rejects \(H_0: \theta = \theta_0 \quad \text{versus} \quad \theta \neq \theta_0\) if and only if \(\theta_0 \notin C\) where
\[C = (\hat{\theta} - \widehat{\se}z_{\alpha / 2}, \hat{\theta} + \widehat{\se}z_{\alpha / 2})\]
Thus, testing the hypothesis is equivalent to checking whether the null value is in the confidence interval.
Rejecting \(H_0\) indicates the result is statistically significant. That is, we have strong evidence to reject \(H_0\). The result or effect size can still be small if our test is powerful. In that situation, we have statistical significance but not necessarily scientific/substantive/practical significance. You should always be concerned with both of these types of significance. Statistical significance alone is not necessarily a useful or informative finding.
We could use a more fine-grained measure of the evidence against the null hypothesis. Generally, if the test rejects at level \(\alpha\) it will also reject at level \(\alpha' > \alpha\). Hence, there is the smallest \(\alpha\) at which the test rejects and we call this number the \(p\)-value. Informally, the smaller the \(p\)-value, the stronger the evidence against \(H_0\). Remember that this \(\alpha\) will be a function of the power of the test, so both the magnitude of the difference between \(\theta_*\) and \(\theta_0\) and the sample size will influence this value.
\(p\)-value | evidence |
---|---|
\(< .01\) | very strong evidence against \(H_0\) |
\(.01 - .05\) | strong evidence against \(H_0\) |
\(.05 - .10\) | weak evidence against \(H_0\) |
\(> .1\) | little or no evidence against \(H_0\) |
Suppose that the size \(\alpha\) test is of the form
\[\text{reject } H_0 \text{ if and only if } T(X_n) \geq c_\alpha\]
Then,
\[\text{p-value} = \text{sup}_{\theta \in \Theta_0} \Pr_\theta (T(X^n) \geq T(x^n))\]
where \(x^n\) is the observed value of \(X^n\). If \(\Theta_0 = \{ \theta_0 \}\) then
\[\text{p-value} = \Pr_{\theta_0} (T(X^n) \geq T(x^n))\]
Informally, the \(p\)-value is the probability under \(H_0\) of observing a value of the test statistic the same as or more extreme then what was actually observed.
Let
\[w = \frac{\hat{\theta} - \theta_0}{\widehat{\se}}\]
denote the observed value of the Wald statistic \(W\). The \(p\)-value is given by
\[\text{p-value} = \Pr_{\theta_0} (|W| > |w|) \approx \Pr (|Z| > |w| = 2 \Phi(-|w|)\]
where \(Z \sim N(0,1)\).
data_frame(x = seq(-3, 3, by = 0.01),
y = dnorm(x),
tails = x < qnorm(p = 0.025) | x > qnorm(p = 0.975)) %>%
ggplot(aes(x, y)) +
geom_linerange(aes(ymax = y, ymin = 0, color = tails)) +
scale_color_brewer(type = "qual", labels = expression(NULL, alpha/2)) +
geom_vline(xintercept = qnorm(p = c(0.025, 0.975)), linetype = 2) +
annotate(geom = "text", x = qnorm(p = c(0.025, 0.975)) - .25, y = .35,
label = expression(paste("|", -w, "|"), paste("|", w, "|"))) +
labs(title = "P-value for the Wald test statistic",
x = NULL,
y = NULL,
color = NULL) +
theme_void()
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
Consider a set of 371 individuals in a health study examining cholesterol levels (in mg/dl). 320 individuals have narrowing of the arteries, while 51 patients have no evidence of heart disease. Is the mean cholesterol different in the two groups?
Let the estimated mean cholesterol levels for the first group be \(\bar{X} = 216.2\) and for the second group \(\bar{Y} = 195.3\). Let the estimated standard error for each group be \(\widehat{\se}(\hat{\mu}_1) = 5.0\) and \(\widehat{\se}(\hat{\mu}_2) = 2.4\). The Wald test statistic is
\[W = \frac{\hat{\delta} - 0}{\widehat{\se}} = \frac{\bar{X} - \bar{Y}}{\sqrt{\widehat{\se}_1^2 + \widehat{\se}_2^2}} = \frac{216.2 - 195.3}{\sqrt{5^2 + 2.4^2}} = 3.78\]
To compute the \(p\)-value, let \(Z \sim N(0,1)\) denote a standard Normal random variable. Then
\[\text{p-value} = \Pr (|Z| > 3.78) = 2 \Pr(Z < -3.78) = 0.0002\]
which is very strong evidence against the null hypothesis.
Let \(Z_1, \ldots, Z_k\) be independent, standard Normals. Let \(V = \sum_{i=1}^k Z_i^2\). Then we say that \(V\) has a \(\chi^2\) distribution with \(k\) degrees of freedom, written \(V \sim \chi_k^2\). The pdf of \(V\) is
\[f(v) = \frac{v^{\frac{k}{2} - 1} e^{-\frac{v}{2}}}{2^{\frac{k}{2} \Gamma \left(\frac{k}{2} \right)}}\]
for \(v>0\).
df <- c(1, 2, 3, 4, 6, 9)
ggplot(data.frame(x = c(0, 8)), aes(x)) +
stat_function(aes(color = as.character(df[1])), fun=dchisq, args=list(df=df[1]),
size=1, alpha=.8) +
stat_function(aes(color = as.character(df[2])), fun=dchisq, args=list(df=df[2]),
size=1, alpha=.8) +
stat_function(aes(color = as.character(df[3])), fun=dchisq, args=list(df=df[3]),
size=1, alpha=.8) +
stat_function(aes(color = as.character(df[4])), fun=dchisq, args=list(df=df[4]),
size=1, alpha=.8) +
stat_function(aes(color = as.character(df[5])), fun=dchisq, args=list(df=df[5]),
size=1, alpha=.8) +
stat_function(aes(color = as.character(df[6])), fun=dchisq, args=list(df=df[6]),
size=1, alpha=.8) +
ylim(0,.5) +
scale_color_brewer(type = "qual") +
labs(title = expression(chi^2),
x = expression(X),
y = NULL,
color = expression(k))
It can be shown that \(\E[V] = k\) and \(\Var (V) = 2k\). We define the upper \(\alpha\) quantile \(\chi_{k,\alpha}^2 = F^{-1} (1 - \alpha)\) where \(F\) is the CDF. That is, \(\Pr(\chi_k^2 > \chi_{k,\alpha}^2) = \alpha\).
Pearson’s \(\chi^2\) test is used for multinomial data. Recall that if \(X = (X_1, \ldots, X_k)\) has a multinomial \((n,p)\) distribution, then the MLE of \(p\) is \(\hat{p} = (\hat{p}_1, \ldots, \hat{p}_k) = (x_1 / n, \ldots, x_k / n)\).
Let \(p_0 = (p_{01}, \ldots, p_{0k})\) be some fixed vector and suppose we want to test
\[H_0: p = p_0 \quad \text{versus} \quad H_1: p \neq p_0\]
Pearson’s \(\chi^2\) statistic is
\[T = \sum_{j=1}^k \frac{(X_j - np_{0j})^2}{np_{0j}} = \sum_{j=1}^k \frac{(X_j - E_j)^2}{E_j}\]
where \(E_j = \E[X_j] = np_{0j}\) is the expected value under \(H_0\).
Say the null hypothesis is correct - there are no differences between ideological groups and attitudes towards abortion. What would the table look like?
Right to Abortion | Liberal | Moderate | Conservative | Total |
---|---|---|---|---|
Yes | 40.8% | 40.8% | 40.8% | 40.8% |
(206.45) | (289.68) | (271.32) | (768) | |
No | 59.2% | 59.2% | 59.2% | 59.2% |
(299.55) | (420.32) | (393.68) | (1113) | |
Total | 26.9% | 37.7% | 35.4% | 100% |
(506) | (710) | (665) | (1881) |
In truth, what does our table actually look like?
Right to Abortion | Liberal | Moderate | Conservative | Total |
---|---|---|---|---|
Yes | 62.6% | 36.6% | 28.7% | 40.8% |
(317) | (260) | (191) | (768) | |
No | 37.4% | 63.4% | 71.28% | 59.2% |
(189) | (450) | (474) | (1113) | |
Total | 26.9% | 37.7% | 35.4% | 100% |
(506) | (710) | (665) | (1881) |
How can we test if these differences are statistically significant? That is, how do we test to see if we can reject the null hypothesis?
Right to Abortion | Liberal | Moderate | Conservative | |
---|---|---|---|---|
Yes | Observed Frequency (\(X_j\)) | 317.0 | 260.0 | 191.0 |
Expected Frequency (\(E_j\)) | 206.6 | 289.9 | 271.5 | |
\(X_j - E_j\) | 110.4 | -29.9 | -80.5 | |
\((X_j - E_j)^2\) | 12188.9 | 893.3 | 6482.7 | |
\(\frac{(X_j - E_j)^2}{E_j}\) | 59.0 | 4.1 | 23.9 | |
No | Observed Frequency (\(X_j\)) | 189.0 | 450.0 | 474.0 |
Expected Frequency (\(E_j\)) | 299.4 | 420.1 | 393.5 | |
\(X_j - E_j\) | -110.4 | 29.9 | 80.5 | |
\((X_j - E_j)^2\) | 12188.9 | 893.3 | 6482.7 | |
\(\frac{(X_j - E_j)^2}{E_j}\) | 40.7 | 2.1 | 16.5 |
The Wald test is useful for testing a scalar parameter. The likelihood ratio test is more general and can be used for testing a vector-valued parameter.
Consider testing
\[H_0: \theta \in \Theta_0 \quad \text{versus} \quad H_1: \theta \notin \Theta_0\]
The likelihood ratio statistic is
\[\lambda = 2 \log \left( \frac{\text{sup}_{\theta \in \Theta} \Lagr (\theta)}{\text{sup}_{\theta \in \Theta_0} \Lagr (\theta)} \right) = 2 \log \left( \frac{\Lagr(\hat{\theta})}{\Lagr (\hat{\theta}_0)} \right)\]
where \(\hat{\theta}\) is the MLE and \(\hat{\theta}_0\) is the MLE when \(\theta\) is restricted to line in \(\Theta_0\).
The likelihood ratio test is most useful when \(\Theta_0\) consists of all parameter values \(\theta\) such that some coordinates of \(\theta\) are fixed at particular values.
Suppose that \(\theta = (\theta_1, \ldots, \theta_q, \theta_{q + 1}, \ldots, \theta_r)\). Let
\[\Theta_0 = \left\{ \theta: (\theta_{q+ 1}, \ldots, \theta_r) = (\theta_{0,q+1}, \ldots, \theta_{0,r}) \right\}\]
Let \(\lambda\) be the likelihood ratio test statistic. Under \(H_0: \theta \in \Theta_0\),
\[\lambda(x^n) \leadsto \chi_{r - q, \alpha}^2\]
where \(r-q\) is the dimension of \(\Theta\) minus the dimension of \(\Theta_0\). The \(p\)-value for the test is \(\Pr (\chi_{r-q}^2 > \lambda)\).
For example, if \(\theta = (\theta_1, \theta_2, \theta_3, \theta_4, \theta_5)\) and we want to test the null hypothesis that \(\theta_4 = \theta_5 = 0\) then the limiting distribution has \(5-3 = 2\) degrees of freedom.
Mendel bred peas with round yellow seeds and wrinkled green seeds. There are four types of progeny: round yellow, wrinkled yellow, round green, and wrinkled green. The number of each type is multinomial with probability \(p = (p_1, p_2, p_3, p_4)\). His history of inheritance predicts that \(p\) is equal to
\[p_0 \equiv \left(\frac{9}{16}, \frac{3}{16}, \frac{3}{16}, \frac{1}{16} \right)\]
In \(n = 556\) trials he observed \(X = (315,101,108,32)\). The likelihood ratio test for \(H_0: p = p_0 \quad \text{versus} \quad H_1: p \neq p_0\) is
\[ \begin{align} \lambda &= 2 \log \left( \frac{\Lagr (\hat{p})}{\Lagr(p_0)} \right) \\ &= 2 \sum_{i=1}^4 X_j \log \left( \frac{\hat{p}_j}{p_{0j}} \right) \\ &= 2 \left[ 315 \log \left( \frac{\dfrac{315}{556}}{\dfrac{9}{16}} \right) + 101 \log \left( \frac{\dfrac{101}{556}}{\dfrac{3}{16}} \right) + 108 \log \left( \frac{\dfrac{108}{556}}{\dfrac{3}{16}} \right) + 32 \log \left( \frac{\dfrac{32}{556}}{\dfrac{1}{16}} \right)\right] \\ &= 0.48 \end{align} \]
Under \(H_1\) there are four parameters. However, three parameters must sum to one so the dimension of the parameter space is three. Under \(H_0\) there are no free parameters so the dimension of the restricted parameter space is zero. The difference of these dimension is three. Therefore the limiting distribution of \(\lambda\) under \(H_0\) is \(\chi_3^2\) and the p-value is
\[\text{p-value} = \Pr (\chi_3^2 > 0.48) = 0.923\]
A similar result is achieved using the \(\chi^2\) test. When both tests are applicable, they will lead to similar results as long as the sample size is large.
Consider a sample of 100 observations of a continuous outcome of interest \(Y\) measured with 10 continuous covariates \(X_1, \ldots, X_k\). In truth, none of the variables are actually predictive of \(Y\) in the population. That is, all the covariates \(\mathbf{X}\) and the outcome \(Y\) are drawn independently from a normal distribution \(\sim N(0,1)\). If we use ordinary least squares regression and focus on just a single variable as a predictor, a test of significance will yield \(p < .05\) in approximately 5% of the samples. Below I simulate this process 1000 times, and in each simulation estimate a single regression model between \(Y\) and a randomly selected \(X_k\):
n_obs <- 100
pval_dist <- function(n_obs){
x <- replicate(10, rnorm(n_obs))
y <- rnorm(n_obs)
mod <- lm(y ~ x[, sample(1:10, 1)])
return(tidy(mod)[2,])
}
pvals <- 1000 %>%
rerun(pval_dist(n_obs)) %>%
bind_rows %>%
as_tibble %>%
mutate(sig = p.value < .05)
ggplot(pvals, aes(p.value, fill = sig)) +
geom_histogram(binwidth = .025, boundary = 0) +
labs(title = "Distribution of p-values when null is true",
x = expression(P),
y = NULL) +
theme(legend.position = "none")
The distribution of the p-values is approximately uniform and on average 5% of the p-values are \(< .05\). In this situation, the p-value and our inferences drawn from the p-value are as we would expect because we conducted exactly one null hypothesis test against the sample of data.
What happens instead if we evaluate multiple variables during each test? That is, in each iteration of our simulation we regress all 10 predictors individually against \(Y\). What is the chance that we will find \(p < .05\) for at least one of the predictors?
pval_dist_mult <- function(n_obs){
# generate simulated data
x <- replicate(10, rnorm(n_obs))
y <- rnorm(n_obs)
# estimate a linear model for each column in x and find min pvalue
x %>%
as_tibble %>%
mutate(y = y) %>%
gather(i, x, -y) %>%
group_by(i) %>%
nest() %>%
mutate(mod = map(data, ~ lm(y ~ x, data = .x)),
results = map(mod, tidy)) %>%
unnest(results) %>%
filter(term == "x") %>%
filter(p.value == min(p.value))
}
pvals_mult <- 1000 %>%
rerun(pval_dist_mult(n_obs)) %>%
bind_rows %>%
as_tibble %>%
mutate(sig = p.value < .05)
ggplot(pvals_mult, aes(p.value, fill = sig)) +
geom_histogram(binwidth = .025, boundary = 0) +
labs(title = "Distribution of minimum p-values for 10 tests when null is true",
x = expression(P),
y = NULL) +
theme(legend.position = "none")
Now we have a 40% chance of finding a predictor with \(p < .05\), and the distribution of the minimum p-values is not uniform. When we search for the most significant result, we do not have a fixed null hypothesis. Instead, we are conducting 10 null hypothesis tests using the same sample of data. The distribution of the minimum of 10 random uniform distributions has a density \(k(1 - x)^{k-1}\) for \(k\) independent tests. When \(k=10\), the probability of observing \(p < .05\) is \(1 - (1 - 0.05)^{10} = 0.40\).
By failing to correct for the fact that we conducted multiple hypothesis tests on the same sample of data, we risk false discovery and is a form of selection bias. Even if the tests were not actually performed, we still risk selection bias when any choice of results is based on the outcome, rather than the prespecified hypotheses.
This happens all the time in social science. All scholars and researchers do this. We form a theory, generate hypotheses, collect data to test the hypotheses, and explore multiple model formulations until we settle on the final form. This doesn’t mean we are attempting to commit fraud, we are just using our knowledge to try and estimate a “good” model.
Consider \(m\) hypothesis tests:
\[H_{0i} \quad \text{versus} \quad H_{1i}, i = 1, \ldots, m\]
and let \(P_1, \ldots, P_m\) denote the \(m\) \(p\)-values for these tests. The Bonferroni method approaches the problem as follows: given \(p\)-values \(P_1, \ldots, P_m\), reject null hypothesis \(H_{0i}\) if
\[P_i \leq \frac{\alpha}{m}\]
For example, if a study tested \(m=20\) hypotheses with a desired \(\alpha = 0.05\) (the standard threshold), then the Bonferroni correction would test each individual hypothesis at \(\alpha = \frac{0.05}{20} = 0.0025\).
For example, draw 100 observations from a normal distribution \(N(0,1)\) and test the null hypothesis \(H_0: \mu = 0\) using a t-test.1 If we simulate this process multiple times, we should reject \(H_0\) approximately 5% of the time.
sim_norm_null <- 1000 %>%
rerun(rnorm(n_obs)) %>%
map(~ t.test(x = .x, mu = 0)) %>%
map_dbl(~ .x$p.value) %>%
as_tibble %>%
mutate(sig = value < .05)
mean(sim_norm_null$value)
## [1] 0.498
ggplot(sim_norm_null, aes(value, fill = sig)) +
geom_histogram(binwidth = .025, boundary = 0) +
labs(title = "Distribution of p-values for single test",
x = expression(P),
y = NULL) +
theme(legend.position = "none")
Now let’s simulate 5 random variables and test the null hypothesis that all means are simultaneously 0, then the probability of at least one significant result is \(1 - (1 - 0.05)^5 = 0.226\):
sim_norm_mult <- 1000 %>%
rerun(5 %>%
rerun(rnorm(n_obs)) %>%
map(~ t.test(x = .x, mu = 0)) %>%
map_dbl(~ .x$p.value) %>%
as_tibble %>%
mutate(sig = value < .05)) %>%
bind_rows(.id = "sim") %>%
group_by(sim) %>%
rename(raw = value) %>%
mutate(correct = raw < (.05 / n()))
sim_norm_mult %>%
summarize(sig = any(raw < .05)) %>%
ungroup %>%
summarize(mean(sig))
## # A tibble: 1 x 1
## `mean(sig)`
## <dbl>
## 1 0.239
sim_norm_mult %>%
filter(raw == min(raw)) %>%
ggplot(aes(raw, fill = sig)) +
geom_histogram(binwidth = .01, boundary = 0) +
labs(title = "Distribution of p-values for multiple tests",
x = expression(P),
y = NULL) +
theme(legend.position = "none")
But if we use the Bonferroni correction:
sim_norm_mult %>%
summarize(sig = any(correct)) %>%
ungroup %>%
summarize(mean(sig))
## # A tibble: 1 x 1
## `mean(sig)`
## <dbl>
## 1 0.052
sim_norm_mult %>%
filter(raw == min(raw)) %>%
ggplot(aes(raw, fill = correct)) +
geom_histogram(binwidth = .01, boundary = 0) +
labs(title = "Distribution of p-values for multiple tests",
subtitle = "With Bonferroni correction",
x = expression(P),
y = NULL) +
theme(legend.position = "none")
The Bonferroni correction is a conservative adjustment, and errs on the side of caution. You minimize the risk of a false positive, but therefore increase the risk of a false negative. Alternative correction methods such as the Benjamini–Hochberg procedure are less conservative.
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.5.1 (2018-07-02)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-11-20
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
## base * 3.5.1 2018-07-05 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
## bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
## broom * 0.5.0 2018-07-17 CRAN (R 3.5.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.5.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
## compiler 3.5.1 2018-07-05 local
## crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
## datasets * 3.5.1 2018-07-05 local
## devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
## digest 0.6.18 2018-10-10 cran (@0.6.18)
## dplyr * 0.7.8 2018-11-10 cran (@0.7.8)
## evaluate 0.11 2018-07-17 CRAN (R 3.5.0)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.5.0)
## ggplot2 * 3.1.0 2018-10-25 cran (@3.1.0)
## glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
## graphics * 3.5.1 2018-07-05 local
## grDevices * 3.5.1 2018-07-05 local
## grid 3.5.1 2018-07-05 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
## haven 1.1.2 2018-06-27 CRAN (R 3.5.0)
## hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
## jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
## knitr 1.20 2018-02-20 CRAN (R 3.5.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
## lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
## methods * 3.5.1 2018-07-05 local
## modelr 0.1.2 2018-05-11 CRAN (R 3.5.0)
## munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
## nlme 3.1-137 2018-04-07 CRAN (R 3.5.1)
## patchwork * 0.0.1 2018-09-06 Github (thomasp85/patchwork@7fb35b1)
## pillar 1.3.0 2018-07-14 CRAN (R 3.5.0)
## pkgconfig 2.0.2 2018-08-16 CRAN (R 3.5.1)
## plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
## purrr * 0.2.5 2018-05-29 CRAN (R 3.5.0)
## R6 2.3.0 2018-10-04 cran (@2.3.0)
## Rcpp 1.0.0 2018-11-07 cran (@1.0.0)
## readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
## readxl 1.1.0 2018-04-20 CRAN (R 3.5.0)
## rlang 0.3.0.1 2018-10-25 CRAN (R 3.5.0)
## rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.5.0)
## scales 1.0.0 2018-08-09 CRAN (R 3.5.0)
## stats * 3.5.1 2018-07-05 local
## stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
## stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.5.0)
## tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.0)
## tidyselect 0.2.5 2018-10-11 cran (@0.2.5)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.0)
## tools 3.5.1 2018-07-05 local
## utils * 3.5.1 2018-07-05 local
## withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
## xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
## yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
Example drawn from this StackOverflow question.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.