My question in the title is self explanatory, but I would like to give it some context.
The ASA released a statement earlier this week “on p-values: context, process, and purpose”, outlining various common misconceptions of the p-value, and urging caution in not using it without context and thought (which could be said just about any statistical method, really).
In response to the ASA, professor Matloff wrote a blog post titled: After 150 Years, the ASA Says No to p-values. Then professor Benjamini (and I) wrote a response post titled It’s not the p-values’ fault – reflections on the recent ASA statement. In response to it professor Matloff asked in a followup post:
What I would like to see [… is] — a good, convincing example
in which p-values are useful. That really has to be the bottom line.
To quote his two major arguments against the usefulness of the $p$-value:
-
With large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis.
-
Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.
I am very interested in what other crossvalidated community members think of this question/arguments, and of what may constitute a good response to it.
Best Answer
I will consider both Matloff's points:
The logic here is that if somebody reports highly significant $p=0.0001$, then from this number alone we cannot say if the effect is large and important or irrelevantly tiny (as can happen with large $n$). I find this argument strange and cannot connect to it at all, because I have never seen a study that would report a $p$-value without reporting [some equivalent of] effect size. Studies that I read would e.g. say (and usually show on a figure) that group A had such and such mean, group B had such and such mean and they were significantly different with such and such $p$-value. I can obviously judge for myself if the difference between A and B is large or small.
(In the comments, @RobinEkman pointed me to several highly-cited studies by Ziliak & McCloskey (1996, 2004) who observed that the majority of the economics papers trumpet "statistical significance" of some effects without paying much attention to the effect size and its "practical significance" (which, Z&MS argue, can often be minuscule). This is clearly bad practice. However, as @MatteoS explained below, the effect sizes (regression estimates) are always reported, so my argument stands.)
This concern is also often voiced, but here again I cannot really connect to it. It is important to realize that researchers do not increase their $n$ ad infinitum. In the branch of neuroscience that I am familiar with, people will do experiments with $n=20$ or maybe $n=50$, say, rats. If there is no effect to be seen then the conclusion is that the effect is not large enough to be interesting. Nobody I know would go on breeding, training, recording, and sacrificing $n=5000$ rats to show that there is some statistically significant but tiny effect. And whereas it might be true that almost no real effects are exactly zero, it is certainly true that many many real effects are small enough to be detected with reasonable sample sizes that reasonable researchers are actually using, exercising their good judgment.
(There is a valid concern that sample sizes are often not big enough and that many studies are underpowered. So perhaps researchers in many fields should rather aim at, say, $n=100$ instead of $n=20$. Still, whatever the sample size is, it puts a limit on the effect size that the study has power to detect.)
In addition, I do not think I agree that almost no null hypotheses are true, at least not in the experimental randomized studies (as opposed to observational ones). Two reasons:
Very often there is a directionality to the prediction that is being tested; researcher aims to demonstrate that some effect is positive $\delta>0$. By convention this is usually done with a two-sided test assuming a point null $H_0: \delta=0$ but in fact this is rather a one-sided test trying to reject $H_0: \delta<0$. (@CliffAB's answer, +1, makes a related point.) And this can certainly be true.
Even talking about the point "nil" null $H_0: \delta=0$, I do not see why they are never true. Some things are just not causally related to other things. Look at the psychology studies that are failing to replicate in the last years: people feeling the future; women dressing in red when ovulating; priming with old-age-related words affecting walking speed; etc. It might very well be that there are no causal links here at all and so the true effects are exactly zero.
Himself, Norm Matloff suggests to use confidence intervals instead of $p$-values because they show the effect size. Confidence intervals are good, but notice one disadvantage of a confidence interval as compared to the $p$-value: confidence interval is reported for one particular coverage value, e.g. $95\%$. Seeing a $95\%$ confidence interval does not tell me how broad a $99\%$ confidence interval would be. But one single $p$-value can be compared with any $\alpha$ and different readers can have different alphas in mind.
In other words, I think that for somebody who likes to use confidence intervals, a $p$-value is a useful and meaningful additional statistic to report.
I would like to give a long quote about the practical usefulness of $p$-values from my favorite blogger Scott Alexander; he is not a statistician (he is a psychiatrist) but has lots of experience with reading psychological/medical literature and scrutinizing the statistics therein. The quote is from his blog post on the fake chocolate study which I highly recommend. Emphasis mine.
For some further discussion of various alternatives to $p$-values (including Bayesian ones), see my answer in ASA discusses limitations of $p$-values - what are the alternatives?