Solved – How to apply multiple testing correction for gene list overlap using R

bonferronigeneticshypergeometric-distributionmultiple-comparisonsr

I have 2 studies looking at the patient response to the same drug. Study 1 found 10,000 genes expressed above the background and 500 of them are differentially expressed and referred to as the drug response signature. Study 2 found 1,000 genes representing the drug response signature. The overlap between the two signatures is 100 genes.

I want to calculate the statistical significance of the overlap between the signatures. If I understand correctly, one way of doing it (based on the posts here: Calculating the probability of gene list overlap between an RNA seq and a ChLP-chip dataset and here: Using R's phyper to get the probability of list overlap) is via phyper():

> overlap  <- 100
> list1    <- 500
> totalPop <- 10000
> list2    <- 1000
> 
> 1-phyper(overlap-1, list1, totalPop-list1, list2)
[1] 4.103051e-12
  1. Does that sound reasonable?

  2. If I wanted to apply the Bonferroni correction, I would need to multiply this p-value by the number of comparisons. What would the number of comparisons correspond to in this case? List2? Alternatively, what would be a quick way of doing less conservative correction (e.g., Benjamini-Hochberg)?

Best Answer

I don't know anything about gene expression studies but I do have some interest in multiple inference so I will risk an answer on this part of the question anyway.

Personally, I would not approach the problem in that way. I would adjust the error level in the original studies, compute the new overlap and leave the test at the end alone. If the number of differentially expressed genes (and any other result you are using) is already based on adjusted tests, I would argue that you don't need to do anything.

If you cannot go back to the original data and really do want to adjust the p-value, you can indeed multiply it by the number of tests but I don't see why it should have anything to do with the size of list2. It would make more sense to adjust for the total number of tests performed in both studies (i.e. two times the population). This is going to be brutal, though.

To adjust p-values in R, you can use p.adjust(p), where p is a vector of p-values.

p.adjust(p, method="bonferroni") # Bonferroni method, simple multiplication
p.adjust(p, method="holm") # Holm-Bonferroni method, more powerful than Bonferroni
p.adjust(p, method="BH") # Benjamini-Hochberg

As stated in the help file, there is no reason not to use Holm-Bonferroni over Bonferroni as it also provides strong control of the familywise error rate in any case but is more powerful. Benjamini-Hochberg controls the false discovery rate, which is a less stringent criterion.


Edited after the comment below:

The more I think about the problem, the more I think that a correction for multiple comparisons is unnecessary and inappropriate in this situation. This is where the notion of a “family” of hypotheses kicks in. Your last test isn't quite comparable to all the earlier tests, there is no risk of “capitalizing on chance” or cherry-picking significant results, there is only one test of interest and it's legitimate to use the ordinary error level for this one.

Even if you correct aggressively for the many tests performed before, you would still not be directly addressing the main concern, which is the fact that some of the genes in both lists might have been spuriously detected as differentially expressed. The earlier test results still “stand” and if you want to interpret these results while controlling the family-wise error rate, you still need to correct all of them too.

But if the null hypothesis really is true for all genes, any significant result would be a false positive and you would not expect the same gene to be flagged again in the next sample. Overlap between both lists would therefore happen only by chance and this is exactly what the test based on the hypergeometric distribution is testing. So even if the lists of genes are complete junk, the result of that last test is safe. Intuitively, it seems that anything in-between (a mix of true and false hypotheses) should be fine too.

Maybe someone with more experience in this field might weigh in but I think an adjustment would only become necessary if you want to compare the total number of genes detected or find out which ones are differentially expressed, i.e. if you want to interpret the thousands of individual tests performed in each study.

Related Question