Hypothesis Testing – Using Multiple Comparisons Corrections in Fisher P-Value Framework

hypothesis testingmultiple-comparisonsp-value

I am wary of stepping into the civil war between Fisher vs Neyman-Pearson interpretations of $p$-value (which has been well elucidated here and here), but I've been pondering a question that I keep going in circles on. Is it appropriate to make a correction for multiple comparisons to a $p$-value in a Fisher paradigm of interpreting $p$-values?

Now, as I understand it, corrections for multiple comparisons are officially made to $\alpha$, not to $p$-values. For example, Bonferroni's correction is $\alpha/k$, where $k$ is the number of comparisons. But it's an easy switch to multiply the $p$-value by $k$, and equivalent conclusions are drawn in a Neyman-Pearson framework (it's also easier to present to folks more comfortable seeing $\alpha=0.05$).

However, the Fisher framework doesn't have an $\alpha$ value. $P$-value is thought more as evidence against the null hypothesis instead of a hard decision criterion. Given that, is it still reasonable to correct this $p$-value for multiple comparisons?


For some additional context, I work in a drug discovery oriented environment, and the necessary conditions for using Neyman-Pearson (particularly power requirements) aren't met. The Fisher framework seems much more appropriate to guide pursuing promising treatments (in a discovery mode, not a confirmatory mode). An example would be if there are $5$ potential treatment options compared to a control, and the best option showed a clinically meaningful difference with significance at $p=0.045$. If that isn't corrected, it seems promising and should be pursued. If I correct it using Bonferroni however, $p=0.225$, and I would likely go back to the drawing board and throw out all $5$ treatments.

[Note: I don't have to correct using Bonferroni either; it's just the easiest to use an example. I'm more interested in the theory of applying corrections for multiple comparisons and family-wise error rates.]

Best Answer

Following @MichaelLew's answer (+1), I changed my point of view to the opposite one; now I think that $p$-values should NOT be corrected. I have reworked my answer.

To make the discussion more lively, I will refer to the famous XKCD comic where $20$ colours of jelly beans are independently tested to be linked to acne, and green jelly beans yield $p<0.05$; for concreteness, let us assume it was $p=0.02$:

green jelly beans

The Fisher approach is to consider $p$-value as quantifying the strength of evidence, or rather as a measure of surprise ("surprisingness") -- I like this expression and find it intuitively clear and the same time quite precise. We pretend that the null is true and quantify how surprised we should then be to observe such results. This yields a $p$-value. In the "hybrid" Fisher-Neyman-Pearson approach, if we are surprised more than some chosen surprisingness threshold ($p<\alpha$) then we additionally call the results "significant"; this allows to control type I error rate.

Importantly, the threshold should represent our prior beliefs and expectations. For example, "extraordinary claims require extraordinary evidence": we would need to be very surprised to believe the evidence of e.g. clairvoyance, and so would like to set a very low threshold.

In the jelly beans example, each individual $p$-value reflects the surprisingness of each individual correlation. Bonferroni correction replaces $\alpha$ with $\alpha/k$ to control the overall type I error rate. In the first version of this answer, I argued that we should also be less surprised (and should consider that we have less evidence) by getting $p=0.02$ for green jelly beans if we know that we ran $20$ tests, hence Fisher's $p$-values should also be replaced with $kp$.

Now I think it is wrong, and $p$-values should not be adjusted.

First of all, let's point out that for the hybrid approach to be coherent, we cannot possibly adjust both, $p$-values and $\alpha$ threshold. Only one or another can be adjusted. Here are two arguments for why it should be $\alpha$.

  1. Consider exactly the same jelly beans setting, but now we a priori expected that green jelly beans are likely to be linked to acne (say, somebody suggested a theory with this prediction). Then we would be happy to see $p=0.02$ and would not make any adjustments to anything. But nothing about the experiment has changed! If $p$-value is a measure of surprisingness (of each individual experiment), then $p=0.02$ should stay the same. What changes is our $\alpha$, and it is only natural, because as I argued above, the threshold always in one way or another reflects our assumptions and expectations.

  2. $P$-value has a clear interpretation: it is a probability of obtaining the observed (or even less favorable) results under the null hypothesis. If there is no link between green jelly beans and acne, then this probability is $p=0.02$. Replacing it with $kp=20\cdot 0.02=0.4$ ruins this interpretation; this is now not a probability of anything anymore. Moreover, imagine that not the $20$ colours were tested, but $100$. Then $kp=2$, which is larger than $1$, and obviously cannot be a probability. Whereas reducing $\alpha$ by $100$ still makes sense.

To put it in terms of evidence, the "evidence" that green jelly beans are linked to acne is measured as $p=0.02$ and that's that; what changes depending on the circumstances (in this case, on the number of performed tests), is how we treat this evidence.

I should stress that "how we treat the evidence" is something that is very much not fixed in the Fisher's framework either (see this famous quote). When I say that $p$-values should better not be adjusted, it does not mean that Sir Ronald Fisher would look at $p=0.02$ for green jelly beans and consider that a convincing result. I am sure he would still be wary of it.

Concluding metaphor: the process of cherry picking does not modify the cherries! It modifies how we treat these cherries.

Related Question