Hypothesis Testing – Why Frequentist Hypothesis Testing Becomes Biased Towards Rejecting the Null with Large Samples

frequentisthypothesis testing

I was just reading this article on the Bayes factor for a completely unrelated problem when I stumbled upon this passage

Hypothesis testing with Bayes factors is more robust than frequentist hypothesis testing, since the Bayesian form avoids model selection bias, evaluates evidence in favor the null hypothesis, includes model uncertainty, and allows non-nested models to be compared (though of course the model must have the same dependent variable). Also, frequentist significance tests become biased in favor of rejecting the null hypothesis with sufficiently large sample size. [emphasis added]

I've seen this claim before in Karl Friston's 2012 paper in NeuroImage, where he calls it the fallacy of classical inference.

I've had a bit of trouble finding a truly pedagogical account of why this should be true. Specifically, I'm wondering:

  1. why this occurs
  2. how to guard against it
  3. failing that, how to detect it

Best Answer

Answer to question 1: This occurs because the $p$-value becomes arbitrarily small as the sample size increases in frequentist tests for difference (i.e. tests with a null hypothesis of no difference/some form of equality) when a true difference exactly equal to zero, as opposed to arbitraily close to zero, is not realistic (see Nick Stauner's comment to the OP). The $p$-value becomes arbitrarily small because the error of frequentist test statistics generally decreases with sample size, with the upshot that all differences are significant to an arbitrary level with a large enough sample size. Cosma Shalizi has written eruditely about this.

Answer to question 2: Within a frequentist hypothesis testing framework, one can guard against this by not making inference solely about detecting difference. For example, one can combine inferences about difference and equivalence so that one is not favoring (or conflating!) the burden of proof on evidence of effect versus evidence of absence of effect. Evidence of absence of an effect comes from, for example:

  1. two one-sided tests for equivalence (TOST),
  2. uniformly most powerful tests for equivalence, and
  3. the confidence interval approach to equivalence (i.e. if the $1-2\alpha$%CI of the test statistic is within the a priori-defined range of equivalence/relevance, then one concludes equivalence at the $\alpha$ level of significance).

What these approaches all share is an a priori decision about what effect size constitutes a relevant difference and a null hypothesis framed in terms of a difference at least as large as what is considered relevant.

Combined inference from tests for difference and tests for equivalence thus protects against the bias you describe when sample sizes are large in this way (two-by-two table showing the four possibilities resulting from combined tests for difference—positivist null hypothesis, $\text{H}_{0}^{+}$—and equivalence—negativist null hypothesis, $\text{H}_{0}^{-}$):

Four possibilities from combined tests for difference and tests for equivalence

Notice the upper left quadrant: an overpowered test is one where yes you reject the null hypothesis of no difference, but you also reject the null hypothesis of relevant difference, so yes there's a difference, but you have a priori decided you do not care about it because it is too small.

Answer to question 3: See answer to 2.

Related Question