P-Value – Are Smaller P-Values More Convincing

confidence intervaleffect-sizehypothesis testingp-valuestatistical significance

I've been reading up on $p$-values, type 1 error rates, significance levels, power calculations, effect sizes and the Fisher vs Neyman-Pearson debate. This has left me feeling a bit overwhelmed. I apologise for the wall of text, but I felt it was necessary to provide an overview of my current understanding of these concepts, before I moved on to my actual questions.


From what I've gathered, a $p$-value is simply a measure of surprise, the probability of obtaining a result at least as extreme, given that the null hypothesis is true. Fisher originally intended for it to be a continuous measure.

In the Neyman-Pearson framework, you select a significance level in advance and use this as an (arbitrary) cut-off point. The significance level is equal to the type 1 error rate. It is defined by the long run frequency, i.e. if you were to repeat an experiment 1000 times and the null hypothesis is true, about 50 of those experiments would result in a significant effect, due to the sampling variability. By choosing a significance level, we are guarding ourselves against these false positives with a certain probability. $P$-values traditionally do not appear in this framework.

If we find a $p$-value of 0.01 this does not mean that the type 1 error rate is 0.01, the type 1 error is stated a priori. I believe this is one of the major arguments in the Fisher vs N-P debate, because $p$-values are often reported as 0.05*, 0.01**, 0.001***. This could mislead people into saying that the effect is significant at a certain $p$-value, instead of at a certain significance value.

I also realise that the $p$-value is a function of the sample size. Therefore, it cannot be used as an absolute measurement. A small $p$-value could point to a small, non-relevant effect in a large sample experiment. To counter this, it is important to perform an power/effect size calculation when determining the sample size for your experiment. $P$-values tell us whether there is an effect, not how large it is. See Sullivan 2012.

My question:
How can I reconcile the facts that the $p$-value is a measure of surprise (smaller = more convincing) while at the same time it cannot be viewed as an absolute measurement?

What I am confused about, is the following: can we be more confident in a small $p$-value than a large one? In the Fisherian sense, I would say yes, we are more surprised. In the N-P framework, choosing a smaller significance level would imply we are guarding ourselves more strongly against false positives.

But on the other hand, $p$-values are dependent on sample size. They are not an absolute measure. Thus we cannot simply say 0.001593 is more significant than 0.0439. Yet this what would be implied in Fisher's framework: we would be more surprised to such an extreme value. There's even discussion about the term highly significant being a misnomer: Is it wrong to refer to results as being "highly significant"?

I've heard that $p$-values in some fields of science are only considered important when they are smaller than 0.0001, whereas in other fields values around 0.01 are already considered highly significant.

Related questions:

Best Answer

Are smaller $p$-values "more convincing"? Yes, of course they are.

In the Fisher framework, $p$-value is a quantification of the amount of evidence against the null hypothesis. The evidence can be more or less convincing; the smaller the $p$-value, the more convincing it is. Note that in any given experiment with fixed sample size $n$, the $p$-value is monotonically related to the effect size, as @Scortchi nicely points out in his answer (+1). So smaller $p$-values correspond to larger effect sizes; of course they are more convincing!

In the Neyman-Pearson framework, the goal is to obtain a binary decision: either the evidence is "significant" or it is not. By choosing the threshold $\alpha$, we guarantee that we will not have more than $\alpha$ false positives. Note that different people can have different $\alpha$ in mind when looking at the same data; perhaps when I read a paper from a field that I am skeptical about, I would not personally consider as "significant" results with e.g. $p=0.03$ even though the authors do call them significant. My personal $\alpha$ might be set to $0.001$ or something. Obviously the lower the reported $p$-value, the more skeptical readers it will be able to convince! Hence, again, lower $p$-values are more convincing.

The currently standard practice is to combine Fisher and Neyman-Pearson approaches: if $p<\alpha$, then the results are called "significant" and the $p$-value is [exactly or approximately] reported and used as a measure of convincingness (by marking it with stars, using expressions as "highly significant", etc.); if $p>\alpha$ , then the results are called "not significant" and that's it.

This is usually referred to as a "hybrid approach", and indeed it is hybrid. Some people argue that this hybrid is incoherent; I tend to disagree. Why would it be invalid to do two valid things at the same time?

Further reading: