Benjamini-Hochberg Correction – Is It More Conservative with More Comparisons?

multiple-comparisonsp-value

How conservative is Benjamini-Hochberg multiple testing correction relative to the total number of comparisons? For example, if I have a list of 18,000 features for two groups and I perform a Wilcoxon test to get a p-value. I adjust that p-value using Benjamini-Hochberg and next to nothing comes out as significant.

I know that Bonferroni correction can be quite conservative as the number of comparisons increases, does Benjamini-Hochberg have the same property?

Best Answer

First, you need to understand that these two multiple testing procedures do not control the same thing. Using your example, we have two groups with 18,000 observed variables, and you make 18,000 tests in order to identify some variables which are different from one group to the other.

Bonferroni correction controls the Familywise error rate, that is the probability, assuming all the 18,000 variables have identical distribution in the two groups, that you are falsely claiming "here I have some significant differences". Usually, you decide that if this probability is < 5%, your claim is credible.
Benjamini-Hochberg correction controls the False discovery rate, that is, the expected proportion of false positives among the variables for which you claim the existence of a difference. For example, if with FDR controlled to 5% 20 tests are positive, "in average" only 1 of these tests will be a false positive.

Now, when the number of comparison increases... well, it depends on the number of marginal null hypotheses that are true. But basically, with both procedures, if you have a few, let’s says 5 or 10, truly associated variables, you have more chances to detect them among 100 variables than among 1,000,000 variables. That should be intuitive enough. There’s no way to avoid this.

Related Solutions

Solved – Benjamini-Hochberg dependency assumptions justified

The validity of the BH procedure depends on the hypothesis tests being positively dependent. If you read their 2001 paper you would see that it is not necessary to be multivariate normal, they gave weak conditions in the paper:

Rosenbaum’s (1984) conditional (positive) association, is enough to imply PRDS: $X$ is conditionally associated, if for any partition $(X1,$ $X2)$ of $X$, and any function $h(X1), X2$ given $h(X1)$ is positively associated.

If these seems like a reasonable assumption to make about your data, then just declare it as an assumption and try to come up with scenarios where it is and isn't met to clarify it to yourself.

Solved – How to apply multiple testing correction for gene list overlap using R

I don't know anything about gene expression studies but I do have some interest in multiple inference so I will risk an answer on this part of the question anyway.

Personally, I would not approach the problem in that way. I would adjust the error level in the original studies, compute the new overlap and leave the test at the end alone. If the number of differentially expressed genes (and any other result you are using) is already based on adjusted tests, I would argue that you don't need to do anything.

If you cannot go back to the original data and really do want to adjust the p-value, you can indeed multiply it by the number of tests but I don't see why it should have anything to do with the size of list2. It would make more sense to adjust for the total number of tests performed in both studies (i.e. two times the population). This is going to be brutal, though.

To adjust p-values in R, you can use p.adjust(p), where p is a vector of p-values.

p.adjust(p, method="bonferroni") # Bonferroni method, simple multiplication
p.adjust(p, method="holm") # Holm-Bonferroni method, more powerful than Bonferroni
p.adjust(p, method="BH") # Benjamini-Hochberg

As stated in the help file, there is no reason not to use Holm-Bonferroni over Bonferroni as it also provides strong control of the familywise error rate in any case but is more powerful. Benjamini-Hochberg controls the false discovery rate, which is a less stringent criterion.

Edited after the comment below:

The more I think about the problem, the more I think that a correction for multiple comparisons is unnecessary and inappropriate in this situation. This is where the notion of a “family” of hypotheses kicks in. Your last test isn't quite comparable to all the earlier tests, there is no risk of “capitalizing on chance” or cherry-picking significant results, there is only one test of interest and it's legitimate to use the ordinary error level for this one.

Even if you correct aggressively for the many tests performed before, you would still not be directly addressing the main concern, which is the fact that some of the genes in both lists might have been spuriously detected as differentially expressed. The earlier test results still “stand” and if you want to interpret these results while controlling the family-wise error rate, you still need to correct all of them too.

But if the null hypothesis really is true for all genes, any significant result would be a false positive and you would not expect the same gene to be flagged again in the next sample. Overlap between both lists would therefore happen only by chance and this is exactly what the test based on the hypergeometric distribution is testing. So even if the lists of genes are complete junk, the result of that last test is safe. Intuitively, it seems that anything in-between (a mix of true and false hypotheses) should be fine too.

Maybe someone with more experience in this field might weigh in but I think an adjustment would only become necessary if you want to compare the total number of genes detected or find out which ones are differentially expressed, i.e. if you want to interpret the thousands of individual tests performed in each study.

Best Answer

Related Solutions

Solved – Benjamini-Hochberg dependency assumptions justified

Solved – How to apply multiple testing correction for gene list overlap using R

Related Question