Hypothesis Testing – Why Frequentist Hypothesis Testing Becomes Biased Towards Rejecting the Null with Large Samples

frequentisthypothesis testing

I was just reading this article on the Bayes factor for a completely unrelated problem when I stumbled upon this passage

Hypothesis testing with Bayes factors is more robust than frequentist hypothesis testing, since the Bayesian form avoids model selection bias, evaluates evidence in favor the null hypothesis, includes model uncertainty, and allows non-nested models to be compared (though of course the model must have the same dependent variable). Also, frequentist significance tests become biased in favor of rejecting the null hypothesis with sufficiently large sample size. [emphasis added]

I've seen this claim before in Karl Friston's 2012 paper in NeuroImage, where he calls it the fallacy of classical inference.

I've had a bit of trouble finding a truly pedagogical account of why this should be true. Specifically, I'm wondering:

why this occurs
how to guard against it
failing that, how to detect it

Best Answer

Answer to question 1: This occurs because the $p$-value becomes arbitrarily small as the sample size increases in frequentist tests for difference (i.e. tests with a null hypothesis of no difference/some form of equality) when a true difference exactly equal to zero, as opposed to arbitraily close to zero, is not realistic (see Nick Stauner's comment to the OP). The $p$-value becomes arbitrarily small because the error of frequentist test statistics generally decreases with sample size, with the upshot that all differences are significant to an arbitrary level with a large enough sample size. Cosma Shalizi has written eruditely about this.

Answer to question 2: Within a frequentist hypothesis testing framework, one can guard against this by not making inference solely about detecting difference. For example, one can combine inferences about difference and equivalence so that one is not favoring (or conflating!) the burden of proof on evidence of effect versus evidence of absence of effect. Evidence of absence of an effect comes from, for example:

two one-sided tests for equivalence (TOST),
uniformly most powerful tests for equivalence, and
the confidence interval approach to equivalence (i.e. if the $1-2\alpha$%CI of the test statistic is within the a priori-defined range of equivalence/relevance, then one concludes equivalence at the $\alpha$ level of significance).

What these approaches all share is an a priori decision about what effect size constitutes a relevant difference and a null hypothesis framed in terms of a difference at least as large as what is considered relevant.

Combined inference from tests for difference and tests for equivalence thus protects against the bias you describe when sample sizes are large in this way (two-by-two table showing the four possibilities resulting from combined tests for difference—positivist null hypothesis, $\text{H}_{0}^{+}$—and equivalence—negativist null hypothesis, $\text{H}_{0}^{-}$):

Four possibilities from combined tests for difference and tests for equivalence

Notice the upper left quadrant: an overpowered test is one where yes you reject the null hypothesis of no difference, but you also reject the null hypothesis of relevant difference, so yes there's a difference, but you have a priori decided you do not care about it because it is too small.

Answer to question 3: See answer to 2.

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

The question looks simple, but your reflection around it shows that it is not that simple.

Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.

Computers made those tables useless, but the logic of testing is still the same. You should:

Formulate a null hypothesis.
Formulate an alternative hypothesis.
Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MånsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.

In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.

To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).

If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$

$$ Prob(p < x) = x, \, (0 < x < 1), $$

which is only approximately true for discrete tests.

Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.

set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
    all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491

You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.

Solved – What does Bayesian Hypothesis Testing mean in the framework of inference and decision theory

A statistical model is given by a family of probability distributions. When the model is parametric, this family is indexed by an unknown parameter $\theta$: $$\mathcal{F}=\left\{ f(\cdot|\theta);\ \theta\in\Theta \right\}$$ If one wants to test an hypothesis on $\theta$ like $H_0:\,\theta\in\Theta_0$, one can consider two models are in opposition: $\mathcal{F}$ versus $$\mathcal{F}_0=\left\{ f(\cdot|\theta);\ \theta\in\Theta_0 \right\}$$ From my Bayesian perspective, I am drawing inference on the index of the model behind the data, $\mathcal{M}$. Hence I put a prior on this index, $\rho_0$ and $\rho_a$, as well as on the parameters of both models, $\pi_0(\theta)$ over $\Theta_0$ and $\pi_a(\theta)$ over $\Theta$. And I then deduce the posterior distribution of this index: $$\pi(m=0|x)=\dfrac{\rho_0\int_{\Theta_0} f(x|\theta)\pi_0(\theta)\text{d}\theta}{\rho_0\int_{\Theta_0} f(x|\theta)\pi_0(\theta)\text{d}\theta +(1-\rho_0)\int_{\Theta} f(x|\theta)\pi_a(\theta)\text{d}\theta}$$ The document you linked to goes into much more details into this perspective and should be your entry of choice into statistical testing of hypotheses, unless you can afford to go through a whole Bayesian book. Or even a machine learning book like Kevin Murphy's.

For instance, in the setting where $X\sim\mathcal{N}(\theta,1)$ is observed, if the hypothesis to be tested is $H_0:\theta=0$, the posterior probability that $\theta=0$ is the posterior probability that the model producing the data is $\mathcal{N}(0,1)$. According to the above formula, if the prior distribution on $\theta$ is $\theta\sim\mathcal{N}(0,10)$, and if we put equal weights on both hypotheses, i.e., $\rho_0=1/2$, this posterior probability is \begin{align*}\pi(m=0|x)&=\dfrac{\frac{1}{\sqrt{2\pi}}\exp\{-x^2/2\}}{\frac{1}{\sqrt{2\pi}}\exp\{-x^2/2\} +\int_{\mathbb{R}} \frac{1}{\sqrt{2\pi}}\exp\{-(x-\theta)^2/2\}\frac{1}{\sqrt{2\pi\times10}}\exp\{-\theta^2/20\}\text{d}\theta}\\ &=\dfrac{\exp\{-x^2/2\}}{\exp\{-x^2/2\} +\frac{1}{\sqrt{11}}\exp\{-x^2/22\}} \end{align*}

Best Answer

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

Solved – What does Bayesian Hypothesis Testing mean in the framework of inference and decision theory

Related Question