Solved – When should I use FDR or Bonferroni in multiple comparisons

anovabonferronimany-categoriesmultiple-comparisons

I have the following situations :

(1) Very large number of variables – say 100 variables and 200 samples (observations). Although level of each variable is just 2 or 3 or 4 max. I am doing linear ANOVA to test the hypothesis.

Say, for the first variable, Ho: l1 = l2 = l3 = l4, here l1, l2,l3, and l4 are means at levels 1 to 4. The hypothesis is tested in similar way at all 100 variables independently (not a multiple regression rather talking one variable at time).

(2) a single variable with a high number of levels (lets say 100 levels)

Now the null hypothesis is single :

Ho: l1 = l2 = l3 = l4= ..... = l100

In the both situations should I need some short of multiple comparison type of corrections such as the Bonferroni correction or the false discovery rate correction?

Why?

Best Answer

First of all, both situation are the same from statistical and computational view. In order to perform some analysis, factors are encoded as a number of dummy variables (at least in R). So the factor with 100 levels corresponds to 99 binary variables.

ANOVA F-test is an omnibus test. Applying multi-factor ANOVA to the first situation, you are testing whether means in all groups created by all variables are equal. And only if the result is significant and you want to go deeper then you should worry about multiple comparison problem. However, this point already has been worked out in many ways. Instead of using multiple t-tests and correcting $\alpha$ with FDR or Bonferroni method, it is better to use one of the special post-hoc tests (which are designed to take multiple comparison into account), like Tukey's HSD. This wikipedia article contains the discussion about alternatives.

To summarize: when performing single multi-factor ANOVA multiple comparison workaround is not required. When using a proper post-hoc test, multiple comparison workaround is not required as well.

Related Solutions

Multiple Comparisons – Why Aren’t Multiple Hypothesis Corrections Applied to All Experiments?

This would obviously be an absolute nightmare to do in practice, but suppose it could be done: we appoint a Statistical Sultan and everyone running a hypothesis test reports their raw $p$-values to this despot. He performs some kind of global (literally) multiple comparisons correction and replies with the corrected versions.

Would this usher in a golden age of science and reason? No, probably not.

Let's start by considering one pair of hypotheses, as in a $t$-test. We measure some property of two groups and want to distinguish between two hypotheses about that property: $$\begin{align} H_0:& \textrm{ The groups have the same mean.} \\ H_A:& \textrm{ The groups have different means.} \end{align}$$ In a finite sample, the means are unlikely to be exactly equal even if $H_0$ really is true: measurement error and other sources of variability can push individual values around. However, the $H_0$ hypothesis is in some sense "boring", and researchers are typically concerned with avoiding a "false positive" situation wherein they claim to have found a difference between the groups where none really exists. Therefore, we only call results "significant" if they seem unlikely under the null hypothesis, and, by convention, that unlikeliness threshold is set at 5%.

This applies to a single test. Now suppose you decide to run multiple tests and are willing to accept a 5% chance of mistakenly accepting $H_0$ for each one. With enough tests, you therefore almost certainly going to start making errors, and lots of them.

The various multiple corrections approaches are intended to help you get back to a nominal error rate that you have already chosen to tolerate for individual tests. They do so in slightly different ways. Methods that control the Family-Wise Error Rate, like the Bonferroni, Sidak, and Holm procedures, say "You wanted a 5% chance of making an error on a single test, so we'll ensure that you there's no more than a 5% chance of making any errors across all of your tests." Methods that control the False Discovery Rate instead say "You are apparently okay with being wrong up to 5% of the time with a single test, so we'll ensure that no more than 5% of your 'calls' are wrong when doing multiple tests". (See the difference?)

Now, suppose you attempted to control the family-wise error rate of all hypothesis tests ever run. You are essentially saying that you want a <5% chance of falsely rejecting any null hypothesis, ever. This sets up an impossibly stringent threshold and inference would be effectively useless but there's an even more pressing issue: your global correction means you are testing absolutely nonsensical "compound hypotheses" like

$$\begin{align} H_1: &\textrm{Drug XYZ changes T-cell count } \wedge \\ &\textrm{Grapes grow better in some fields } \wedge&\\ &\ldots \wedge \ldots \wedge \ldots \wedge \ldots \wedge \\&\textrm{Men and women eat different amounts of ice cream} \end{align} $$

With False Discovery Rate corrections, the numerical issue isn't quite so severe, but it is still a mess philosophically. Instead, it makes sense to define a "family" of related tests, like a list of candidate genes during a genomics study, or a set of time-frequency bins during a spectral analysis. Tailoring your family to a specific question lets you actually interpret your Type I error bound in a direct way. For example, you could look at a FWER-corrected set of p-values from your own genomic data and say "There's a <5% chance that any of these genes are false positives." This is a lot better than a nebulous guarantee that covers inferences done by people you don't care about on topics you don't care about.

The flip side of this is that he appropriate choice of "family" is debatable and a bit subjective (Are all genes one family or can I just consider the kinases?) but it should be informed by your problem and I don't believe anyone has seriously advocated defining families nearly so extensively.

How about Bayes?

Bayesian analysis offers coherent alternative to this problem--if you're willing to move a bit away from the Frequentist Type I/Type II error framework. We start with some non-committal prior over...well...everything. Every time we learn something, that information is combined with the prior to generate a posterior distribution, which in turn becomes the prior for the next time we learn something. This gives you a coherent update rule and you could compare different hypotheses about specific things by calculating the Bayes factor between two hypotheses. You could presumably factor out large chunks of the model, which wouldn't even make this particularly onerous.

There is a persistent...meme that Bayesian methods don't require multiple comparisons corrections. Unfortunately, the posterior odds are just another test statistic for frequentists (i.e., people who care about Type I/II errors). They don't have any special properties that control these types of errors (Why would they?) Thus, you're back in intractable territory, but perhaps on slightly more principled ground.

The Bayesian counter-argument is that we should focus on what we can know now and thus these error rates aren't as important.

On Reproduciblity

You seem to be suggesting that improper multiple comparisons-correction is the reason behind a lot of incorrect/unreproducible results. My sense is that other factors are more likely to be an issue. An obvious one is that pressure to publish leads people to avoid experiments that really stress their hypothesis (i.e., bad experimental design).

For example, [in this experiment] (part of Amgen's (ir)reproduciblity initative 6, it turns out that the mice had mutations in genes other than the gene of interest. Andrew Gelman also likes to talk about the Garden of Forking Paths, wherein researchers choose a (reasonable) analysis plan based on the data, but might have done other analyses if the data looked different. This inflates $p$-values in a similar way to multiple comparisons, but is much harder to correct for afterward. Blatantly incorrect analysis may also play a role, but my feeling (and hope) is that that is gradually improving.

Solved – Correcting for multiple comparisons after multiple ANOVAs

Not to worry too much. Now, are the results being interpreted serially or not, i.e., is this or that or the next thing or the next etc. significant? If, for example, we have 6 tests and they are all significant, then we are not saying that a single one of them determines that the whole series of tests is significant, it just is not that situation.

So what is Bonferroni? Bland-Altman explain it thus "If we test a null hypothesis which is in fact true, using 0.05 as the critical significance level, we have a probability of 0.95 of coming to a not significant—that is, correct—conclusion. If we test two independent true null hypotheses, the probability that neither test will be significant is 0.95x0.95=0.90. If we test 20 such hypotheses the probability that none will be significant is $0.95^{20}=0.36$. This gives a probability of 1–0.36=0.64 of getting at least one significant result—we are more likely to get one than not. The expected number of spurious significant results is 20x0.05=1. In general, if we have ($\kappa$) independent significant tests at the ($\alpha$) level of null hypotheses which are all true, the probability that we will get no significant differences is $(1-\alpha)^{\kappa}$. If we make ($\alpha$) small enough we can make the probability that none of the separate tests is significant equal to 0.95. Then if any of the ($\kappa$) tests has a $P$-value less than ($\alpha$) we will have a significant difference between the treatments at the 0.05 level. Since $\alpha$ will be very small, it can be shown that $(1-\alpha)^{\kappa} \approx 1-\kappa \alpha$. If we put $\kappa \alpha=0.05$, so $\alpha=\frac{0.05}{\kappa}$, we will have probability 0.05 that one of the $\kappa$ tests will have a $P$ value less than $\alpha$ if the null hypotheses are true. Thus, if in a clinical trial we compare two treatments within five subsets of patients the treatments will be significantly different at the 0.05 level if there is a P value less than 0.01 within any of the subsets. This is the Bonferroni method. Note that they are not significant at the 0.01 level, but at only the 0.05 level."

On the other hand, if your situation is a MANOVA one, as @DavidLane suggests, then you should use that first, as it will lump all the data into a single test of significance and be more specific to the lumped significance than a Bonferroni correction of multiple serial tests. Your question "to tease apart interactions in the various ANOVAs I have t-tests... do they come into the overall adjustment?" That could be done after MANOVA to first see if the ensemble is significant.