Bonferroni Correction Alternative – Methods for Multiple One-vs-Rest Association Tests

association-measurebonferronicontingency tableshypothesis testingpermutation-test

My question anonymizes the 'nouns' of the question to protect my employer. It is not really about lab rats and experimental treatments.

I am also coming more from a Machine Learning background, so my lingo might reflect that, although I make a honest effort to do the Statistics correctly and use the right terminology.

1,000 lab rats each receive one of 26 experimental treatments labeled A, B, …, Z.
The treatments are very unequally distributed – some drugs were administered to only a few mice, and others to hundreds of mice.

Rats whose blood tests showed significant improvement after two weeks were marked as "Positive Outcome", otherwise they are marked as "Negative Outcome".

To determine which treatments have some kind of association with the outcome, I have constructed 26 separate 2×2 contingency tables which compare "This Treatment" (i.e.; Treatment A) and "Other Treatments" (i.e.; Treatment B-Z) vs Outcome. I do tests for association at the 0.05 significance level.

But wait! Aren't we supposed to use the Bonferroni Correction for multiple testing using the 0.05 / 26 = 0.0019 significance level? Sure, but then nothing is statistically significant, and I know based on domain expertise that this is not a practically useful or "accurate" conclusion.

But because of the lack of independence of the tests, I think that a less conservative correction would still guarantee a false positive rate among all tests.

The tests are not independent – a successful, very frequently administered treatment will be in the "rest" of 25 of the "one-vs-rest" hypotheses.

Looking into other approaches, I don't want to do something exotic like q-value testing (which controls for False Discoveries instead of False Positives) because:

  1. It limits my ability to communicate the results because it is a less common approach
  2. There is a far greater cost to the organization of a False Positive – that seems to be the thing to avoid.

So, I'd like to use a p-value, but I'd just like a correction that reflects reality a bit better than the Bonferroni correction. One that takes into account the lack of independence between the multiple comparisons, for example. Or just avoids the problem altogether.

Do you have recommendations? Permutation Testing seems like it might be a good choice.

Best Answer

First, there is no reason to use the original Bonferroni Correction any more. As the Wikipedia page notes, the Holm modification to the that method is uniformly more powerful while maintaining the same control over family-wise error rate. There are extensions and alternatives that might provide even better power.

Second, I personally find false-discovery rate (FDR) easier to explain and more useful in practice with this type of study than family-wise error rate (FWER). An FDR of 5% essentially means that 5% of the nominally positive results are likely to be false-positives. Even a businessman should be able to understand that. An FWER of 5% means that if I do the same experiment multiple times then in only 5% of experiments will I find any false positives. How many people really understand the frequentist meaning of p-values that underlie FWER, and how many people would really want to miss multiple true positive findings just because there might just be a true negative hiding somewhere in the results?

Third, with a binary outcome you should use a more efficient logistic regression model to handle your data. Your "treatments" would be coded as 26 levels of a single (unordered) factor variable. The logistic regression would determine whether there were any significant differences among the treatments with respect to outcome. If not, you stop. If there are, standard approaches like those used for analysis of variance provide principled ways to deal with multiple comparisons that can be more powerful than what you would get with Holm-Bonferroni.

Related Question