Hypothesis Testing – Convincing Examples of Useful p-Values

bayesianfrequentisthypothesis testinginferencep-value

My question in the title is self explanatory, but I would like to give it some context.

The ASA released a statement earlier this week “on p-values: context, process, and purpose”, outlining various common misconceptions of the p-value, and urging caution in not using it without context and thought (which could be said just about any statistical method, really).

In response to the ASA, professor Matloff wrote a blog post titled: After 150 Years, the ASA Says No to p-values. Then professor Benjamini (and I) wrote a response post titled It’s not the p-values’ fault – reflections on the recent ASA statement. In response to it professor Matloff asked in a followup post:

What I would like to see [… is] — a good, convincing example
in which p-values are useful.
That really has to be the bottom line.

To quote his two major arguments against the usefulness of the $p$-value:

  1. With large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis.

  2. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.

I am very interested in what other crossvalidated community members think of this question/arguments, and of what may constitute a good response to it.

Best Answer

I will consider both Matloff's points:

  1. With large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis.

    The logic here is that if somebody reports highly significant $p=0.0001$, then from this number alone we cannot say if the effect is large and important or irrelevantly tiny (as can happen with large $n$). I find this argument strange and cannot connect to it at all, because I have never seen a study that would report a $p$-value without reporting [some equivalent of] effect size. Studies that I read would e.g. say (and usually show on a figure) that group A had such and such mean, group B had such and such mean and they were significantly different with such and such $p$-value. I can obviously judge for myself if the difference between A and B is large or small.

    (In the comments, @RobinEkman pointed me to several highly-cited studies by Ziliak & McCloskey (1996, 2004) who observed that the majority of the economics papers trumpet "statistical significance" of some effects without paying much attention to the effect size and its "practical significance" (which, Z&MS argue, can often be minuscule). This is clearly bad practice. However, as @MatteoS explained below, the effect sizes (regression estimates) are always reported, so my argument stands.)

  2. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.

    This concern is also often voiced, but here again I cannot really connect to it. It is important to realize that researchers do not increase their $n$ ad infinitum. In the branch of neuroscience that I am familiar with, people will do experiments with $n=20$ or maybe $n=50$, say, rats. If there is no effect to be seen then the conclusion is that the effect is not large enough to be interesting. Nobody I know would go on breeding, training, recording, and sacrificing $n=5000$ rats to show that there is some statistically significant but tiny effect. And whereas it might be true that almost no real effects are exactly zero, it is certainly true that many many real effects are small enough to be detected with reasonable sample sizes that reasonable researchers are actually using, exercising their good judgment.

    (There is a valid concern that sample sizes are often not big enough and that many studies are underpowered. So perhaps researchers in many fields should rather aim at, say, $n=100$ instead of $n=20$. Still, whatever the sample size is, it puts a limit on the effect size that the study has power to detect.)

    In addition, I do not think I agree that almost no null hypotheses are true, at least not in the experimental randomized studies (as opposed to observational ones). Two reasons:

    • Very often there is a directionality to the prediction that is being tested; researcher aims to demonstrate that some effect is positive $\delta>0$. By convention this is usually done with a two-sided test assuming a point null $H_0: \delta=0$ but in fact this is rather a one-sided test trying to reject $H_0: \delta<0$. (@CliffAB's answer, +1, makes a related point.) And this can certainly be true.

    • Even talking about the point "nil" null $H_0: \delta=0$, I do not see why they are never true. Some things are just not causally related to other things. Look at the psychology studies that are failing to replicate in the last years: people feeling the future; women dressing in red when ovulating; priming with old-age-related words affecting walking speed; etc. It might very well be that there are no causal links here at all and so the true effects are exactly zero.

Himself, Norm Matloff suggests to use confidence intervals instead of $p$-values because they show the effect size. Confidence intervals are good, but notice one disadvantage of a confidence interval as compared to the $p$-value: confidence interval is reported for one particular coverage value, e.g. $95\%$. Seeing a $95\%$ confidence interval does not tell me how broad a $99\%$ confidence interval would be. But one single $p$-value can be compared with any $\alpha$ and different readers can have different alphas in mind.

In other words, I think that for somebody who likes to use confidence intervals, a $p$-value is a useful and meaningful additional statistic to report.


I would like to give a long quote about the practical usefulness of $p$-values from my favorite blogger Scott Alexander; he is not a statistician (he is a psychiatrist) but has lots of experience with reading psychological/medical literature and scrutinizing the statistics therein. The quote is from his blog post on the fake chocolate study which I highly recommend. Emphasis mine.

[...] But suppose we're not allowed to do $p$-values. All I do is tell you "Yeah, there was a study with fifteen people that found chocolate helped with insulin resistance" and you laugh in my face. Effect size is supposed to help with that. But suppose I tell you "There was a study with fifteen people that found chocolate helped with insulin resistance. The effect size was $0.6$." I don't have any intuition at all for whether or not that's consistent with random noise. Do you? Okay, then they say we’re supposed to report confidence intervals. The effect size was $0.6$, with $95\%$ confidence interval of $[0.2, 1.0]$. Okay. So I check the lower bound of the confidence interval, I see it’s different from zero. But now I’m not transcending the $p$-value. I’m just using the p-value by doing a sort of kludgy calculation of it myself – “$95\%$ confidence interval does not include zero” is the same as “$p$-value is less than $0.05$”.

(Imagine that, although I know the $95\%$ confidence interval doesn’t include zero, I start wondering if the $99\%$ confidence interval does. If only there were some statistic that would give me this information!)

But wouldn’t getting rid of $p$-values prevent “$p$-hacking”? Maybe, but it would just give way to “d-hacking”. You don’t think you could test for twenty different metabolic parameters and only report the one with the highest effect size? The only difference would be that p-hacking is completely transparent – if you do twenty tests and report a $p$ of $0.05$, I know you’re an idiot – but d-hacking would be inscrutable. If you do twenty tests and report that one of them got a $d = 0.6$, is that impressive? [...]

But wouldn’t switching from $p$-values to effect sizes prevent people from making a big deal about tiny effects that are nevertheless statistically significant? Yes, but sometimes we want to make a big deal about tiny effects that are nevertheless statistically significant! Suppose that Coca-Cola is testing a new product additive, and finds in large epidemiological studies that it causes one extra death per hundred thousand people per year. That’s an effect size of approximately zero, but it might still be statistically significant. And since about a billion people worldwide drink Coke each year, that’s a ten thousand deaths. If Coke said “Nope, effect size too small, not worth thinking about”, they would kill almost two milli-Hitlers worth of people.


For some further discussion of various alternatives to $p$-values (including Bayesian ones), see my answer in ASA discusses limitations of $p$-values - what are the alternatives?