P-Value – Are Smaller P-Values More Convincing

confidence intervaleffect-sizehypothesis testingp-valuestatistical significance

I've been reading up on $p$-values, type 1 error rates, significance levels, power calculations, effect sizes and the Fisher vs Neyman-Pearson debate. This has left me feeling a bit overwhelmed. I apologise for the wall of text, but I felt it was necessary to provide an overview of my current understanding of these concepts, before I moved on to my actual questions.

From what I've gathered, a $p$-value is simply a measure of surprise, the probability of obtaining a result at least as extreme, given that the null hypothesis is true. Fisher originally intended for it to be a continuous measure.

In the Neyman-Pearson framework, you select a significance level in advance and use this as an (arbitrary) cut-off point. The significance level is equal to the type 1 error rate. It is defined by the long run frequency, i.e. if you were to repeat an experiment 1000 times and the null hypothesis is true, about 50 of those experiments would result in a significant effect, due to the sampling variability. By choosing a significance level, we are guarding ourselves against these false positives with a certain probability. $P$-values traditionally do not appear in this framework.

If we find a $p$-value of 0.01 this does not mean that the type 1 error rate is 0.01, the type 1 error is stated a priori. I believe this is one of the major arguments in the Fisher vs N-P debate, because $p$-values are often reported as 0.05*, 0.01**, 0.001***. This could mislead people into saying that the effect is significant at a certain $p$-value, instead of at a certain significance value.

I also realise that the $p$-value is a function of the sample size. Therefore, it cannot be used as an absolute measurement. A small $p$-value could point to a small, non-relevant effect in a large sample experiment. To counter this, it is important to perform an power/effect size calculation when determining the sample size for your experiment. $P$-values tell us whether there is an effect, not how large it is. See Sullivan 2012.

My question:
How can I reconcile the facts that the $p$-value is a measure of surprise (smaller = more convincing) while at the same time it cannot be viewed as an absolute measurement?

What I am confused about, is the following: can we be more confident in a small $p$-value than a large one? In the Fisherian sense, I would say yes, we are more surprised. In the N-P framework, choosing a smaller significance level would imply we are guarding ourselves more strongly against false positives.

But on the other hand, $p$-values are dependent on sample size. They are not an absolute measure. Thus we cannot simply say 0.001593 is more significant than 0.0439. Yet this what would be implied in Fisher's framework: we would be more surprised to such an extreme value. There's even discussion about the term highly significant being a misnomer: Is it wrong to refer to results as being "highly significant"?

I've heard that $p$-values in some fields of science are only considered important when they are smaller than 0.0001, whereas in other fields values around 0.01 are already considered highly significant.

Related questions:

Best Answer

Are smaller $p$-values "more convincing"? Yes, of course they are.

In the Fisher framework, $p$-value is a quantification of the amount of evidence against the null hypothesis. The evidence can be more or less convincing; the smaller the $p$-value, the more convincing it is. Note that in any given experiment with fixed sample size $n$, the $p$-value is monotonically related to the effect size, as @Scortchi nicely points out in his answer (+1). So smaller $p$-values correspond to larger effect sizes; of course they are more convincing!

In the Neyman-Pearson framework, the goal is to obtain a binary decision: either the evidence is "significant" or it is not. By choosing the threshold $\alpha$, we guarantee that we will not have more than $\alpha$ false positives. Note that different people can have different $\alpha$ in mind when looking at the same data; perhaps when I read a paper from a field that I am skeptical about, I would not personally consider as "significant" results with e.g. $p=0.03$ even though the authors do call them significant. My personal $\alpha$ might be set to $0.001$ or something. Obviously the lower the reported $p$-value, the more skeptical readers it will be able to convince! Hence, again, lower $p$-values are more convincing.

The currently standard practice is to combine Fisher and Neyman-Pearson approaches: if $p<\alpha$, then the results are called "significant" and the $p$-value is [exactly or approximately] reported and used as a measure of convincingness (by marking it with stars, using expressions as "highly significant", etc.); if $p>\alpha$ , then the results are called "not significant" and that's it.

This is usually referred to as a "hybrid approach", and indeed it is hybrid. Some people argue that this hybrid is incoherent; I tend to disagree. Why would it be invalid to do two valid things at the same time?

Further reading:

Is the "hybrid" between Fisher and Neyman-Pearson approaches to statistical testing really an "incoherent mishmash"? -- my question about the "hybrid". It generated some discussion, but I am still not satisfied with any of the answers, and plan to get back to that thread at some point.
Is it wrong to refer to results as being "highly significant"? -- see my yesterday's answer, which is essentially saying: it isn't wrong (but perhaps a bit sloppy).
Why are lower p-values not more evidence against the null? Arguments from Johansson 2011 -- an example of an anti-Fisher paper arguing that $p$-values do not provide evidence against the null; the top answer by @Momo does a good job in debunking the arguments. My answer to the title question is: But of course they are.

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

The question looks simple, but your reflection around it shows that it is not that simple.

Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.

Computers made those tables useless, but the logic of testing is still the same. You should:

Formulate a null hypothesis.
Formulate an alternative hypothesis.
Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MånsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.

In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.

To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).

If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$

$$ Prob(p < x) = x, \, (0 < x < 1), $$

which is only approximately true for discrete tests.

Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.

set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
    all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491

You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.

Solved – Interpreting p-values in Fisher vs Neyman-Pearson frameworks

Fisher uses p-values as a continuous measure of evidence against a null hypothesis?

Perhaps. What convinces you of this?

So a p-value of 0.06 would indicate that there is no difference and the null hypothesis is true?

Not at all. How did you go from 'continuous measure of evidence against' to 'there is no difference'?

In particular, Fisher would not make the mistake of thinking that failure to reject makes $H_0$ actually true.

Does a p-value greater than alpha indicate that there is >5% chance of a type one error occuring

No, for two reasons.

(i) if $p>\alpha$ you won't reject, so you can't commit a type I error at all

(ii) You don't even have an $\alpha$ probability of making a type I error, since the type I error rate is a conditional probability, and in real situations, the joint probability is close to zero (that is, point null hypotheses are almost never exactly true; you can only make a type I error when they are exactly true).

[ ... I suppose that I'm arguably acting more as a Bayesian there]

Best Answer

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

Solved – Interpreting p-values in Fisher vs Neyman-Pearson frameworks

Related Question