Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

bioinformaticsbiostatisticsgeneticsmicroarrayr

Hopefully someone on these forums can help me out with this basic problem in gene expression studies.

I did deep sequencing of an experimental and a control tissue. I then obtained fold enrichment values of genes in the experimental sample over control. The reference genome has ~15,000 genes. 3,000 out of 15,000 genes are enriched above a certain cut-off in my sample of interest compared to control.

So:
A = total gene population = 15,000
B = RNA-Seq enriched subpopulation = 3,000.

In a previous ChIP-chip experiment, I found 400 genes that are enriched by ChIP-chip. Of the 400 ChIP-chip genes, 100 genes are in the group of 3,000 enriched RNA-Seq transcripts.

So:
C= total # of ChIP-chip enriched genes = 400.

What is the probability that my 100 ChIP-chip genes would be enriched by RNA-Seq by chance alone? In other words, what is the most prudent way to calculate if my observed overlap between B and C (100 genes) is any better than that obtained by chance alone? From what I have read so far, the best way to test this is by using hypergeometric distribution.

I used an online calculator (stattrek.com) to set up a hypergeometric distribution test with the following parameters:
– pop size=15,000
– # of successes in population=3,000
– sample size=400,
-# of successes in sample=100.
I get the following for Hypergeometric Probability P(x=100)= 0.00224050636447747

The actual # of genes overlapping between B and C = 100. Is this better than by chance alone? Doesn't look like it is if the chance of any one gene being enriched is 1:5 (3,000 out of 15,000). That's why I don't understand how come my P(x=100) I calculated above is 0.0022. That amounts to a 0.2% chance of the overlap occurring by chance. Shouldn't this be much higher?

If I sampled 400 random genes rom the big list of 15,000, then any 80 of these genes would be expected to be enriched by chance alone (1:5). The number of genes that are actually overlapping is 100, so this is just slightly better than by chance.

I also tried to come up with a solution using the dhyper or phyper functions in R (using what I saw in another post):
A=all genes in the genome (15,000)
B=RNA-Seq enriched genes (3,000)
C=ChIP-chip enriched genes (400)
Here's the R input/output (adapted from a previous stackexchange post):

> totalpop <- 15000    
> sample1 <- 3000    
> sample2 <- 400    
> dhyper(0:2, sample1, totalpop-sample1, sample2)    
[1] 4.431784e-40 4.584209e-38 2.364018e-36    
> phyper(-1:2, sample1, totalpop-sample1, sample2)    
[1] 0.000000e+00 4.431784e-40 4.628526e-38 2.410304e-36

I'm not sure how to interpret these numbers. I believe 2.36e-36 is the probability of getting a complete overlap between B and C by chance alone? But this makes no sense, since that probability is much closer to 1:5. If I start with 15,000 genes, 3,000 will be enriched. Similarly, if I start with 400 ChIP-chip genes, 80 of them should be enriched in the RNA-Seq alone due to the 1:5 chances of enrichment in that data set.

What is the proper way to calculate the p-value, according to the hypergeometric distribution, for the overlap of B and C?

Best Answer

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Related Solutions

Solved – How to apply multiple testing correction for gene list overlap using R

I don't know anything about gene expression studies but I do have some interest in multiple inference so I will risk an answer on this part of the question anyway.

Personally, I would not approach the problem in that way. I would adjust the error level in the original studies, compute the new overlap and leave the test at the end alone. If the number of differentially expressed genes (and any other result you are using) is already based on adjusted tests, I would argue that you don't need to do anything.

If you cannot go back to the original data and really do want to adjust the p-value, you can indeed multiply it by the number of tests but I don't see why it should have anything to do with the size of list2. It would make more sense to adjust for the total number of tests performed in both studies (i.e. two times the population). This is going to be brutal, though.

To adjust p-values in R, you can use p.adjust(p), where p is a vector of p-values.

p.adjust(p, method="bonferroni") # Bonferroni method, simple multiplication
p.adjust(p, method="holm") # Holm-Bonferroni method, more powerful than Bonferroni
p.adjust(p, method="BH") # Benjamini-Hochberg

As stated in the help file, there is no reason not to use Holm-Bonferroni over Bonferroni as it also provides strong control of the familywise error rate in any case but is more powerful. Benjamini-Hochberg controls the false discovery rate, which is a less stringent criterion.

Edited after the comment below:

The more I think about the problem, the more I think that a correction for multiple comparisons is unnecessary and inappropriate in this situation. This is where the notion of a “family” of hypotheses kicks in. Your last test isn't quite comparable to all the earlier tests, there is no risk of “capitalizing on chance” or cherry-picking significant results, there is only one test of interest and it's legitimate to use the ordinary error level for this one.

Even if you correct aggressively for the many tests performed before, you would still not be directly addressing the main concern, which is the fact that some of the genes in both lists might have been spuriously detected as differentially expressed. The earlier test results still “stand” and if you want to interpret these results while controlling the family-wise error rate, you still need to correct all of them too.

But if the null hypothesis really is true for all genes, any significant result would be a false positive and you would not expect the same gene to be flagged again in the next sample. Overlap between both lists would therefore happen only by chance and this is exactly what the test based on the hypergeometric distribution is testing. So even if the lists of genes are complete junk, the result of that last test is safe. Intuitively, it seems that anything in-between (a mix of true and false hypotheses) should be fine too.

Maybe someone with more experience in this field might weigh in but I think an adjustment would only become necessary if you want to compare the total number of genes detected or find out which ones are differentially expressed, i.e. if you want to interpret the thousands of individual tests performed in each study.

Solved – Bootstrapping of RNA-Seq data: normal distribution

ENCODE Caltech dataset contains two replicates for K562 cell line. Is that the experiment you are using? And how did you select the gene lists that you currently have at hand? Further, do you want to test all the genes in the list as a "set", or do you want to have a single p-value for each gene?

It is very common to select the gene lists on the basis of some statistical test. For example, R add-on packages edgeR, DESeq and limma (from the Bioconductor project) offer suitable methods for rna-seq data, and especially for small sample sizes. There are also other ways to do this, such as, simple filtering based on FPKM values or their standard deviation. Using some statistical test to arrive to list of differentially expressed genes will also simultaneously give you a p-value for each gene, which might be what you are after. Now that you have a single condition only (just one cell line, K562), the statistical test is equivalent to testing whether the gene's expression is different from 1, or 0 if the FPKM values are also log-transformed.

Each gene's statistical significance can also be tested by using a permutation test, where usually the sample labels are shuffled a large number of times. This approach is not really applicable here, since there are only two samples. For more information on a possible implementation, see this page.

In addition, it is also possible to test the whole genelist as a set, and possibly compare it to all the other genes in the experiment. This can be accomplished with, e.g., the package globaltest (from the Bioconductor project) in R.

Best Answer

Related Solutions

Solved – How to apply multiple testing correction for gene list overlap using R

Solved – Bootstrapping of RNA-Seq data: normal distribution

Related Question