Hypothesis Testing – How To Interpret Failure to Reject the Null Hypothesis in a Large Study

hypothesis testing

A basic limitation of null hypothesis significance testing is that it does not allow a researcher to gather evidence in favor of the null (Source)

I see this claim repeated in multiple places, but I can't find justification for it. If we perform a large study and we don't find statistically significant evidence against the null hypothesis, isn't that evidence for the null hypothesis?

Best Answer

Failing to reject a null hypothesis is evidence that the null hypothesis is true, but it might not be particularly good evidence, and it certainly doesn't prove the null hypothesis.

Let's take a short detour. Consider for a moment the old cliché:

Absence of evidence is not evidence of absence.

Notwithstanding its popularity, this statement is nonsense. If you look for something and fail to find it, that is absolutely evidence that it isn't there. How good that evidence is depends on how thorough your search was. A cursory search provides weak evidence; an exhaustive search provides strong evidence.

Now, back to hypothesis testing. When you run a hypothesis test, you are looking for evidence that the null hypothesis is not true. If you don't find it, then that is certainly evidence that the null hypothesis is true, but how strong is that evidence? To know that, you have to know how likely it is that evidence that would have made you reject the null hypothesis could have eluded your search. That is, what is the probability of a false negative on your test? This is related to the power, $\beta$, of the test (specifically, it is the complement, 1-$\beta$.)

Now, the power of the test, and therefore the false negative rate, usually depends on the size of the effect you are looking for. Large effects are easier to detect than small ones. Therefore, there is no single $\beta$ for an experiment, and therefore no definitive answer to the question of how strong the evidence for the null hypothesis is. Put another way, there is always some effect size small enough that it's not ruled out by the experiment.

From here, there are two ways to proceed. Sometimes you know you don't care about an effect size smaller than some threshold. In that case, you probably should reframe your experiment such that the null hypothesis is that the effect is above that threshold, and then test the alternative hypothesis that the effect is below the threshold. Alternatively, you could use your results to set bounds on the believable size of the effect. Your conclusion would be that the size of the effect lies in some interval, with some probability. That approach is just a small step away from a Bayesian treatment, which you might want to learn more about, if you frequently find yourself in this sort of situation.

There's a nice answer to a related question that touches on evidence of absence testing, which you might find useful.

Related Question