Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

bioinformaticsbiostatisticsgeneticsmicroarrayr

Hopefully someone on these forums can help me out with this basic problem in gene expression studies.

I did deep sequencing of an experimental and a control tissue. I then obtained fold enrichment values of genes in the experimental sample over control. The reference genome has ~15,000 genes. 3,000 out of 15,000 genes are enriched above a certain cut-off in my sample of interest compared to control.

So:
A = total gene population = 15,000
B = RNA-Seq enriched subpopulation = 3,000.

In a previous ChIP-chip experiment, I found 400 genes that are enriched by ChIP-chip. Of the 400 ChIP-chip genes, 100 genes are in the group of 3,000 enriched RNA-Seq transcripts.

So:
C= total # of ChIP-chip enriched genes = 400.

What is the probability that my 100 ChIP-chip genes would be enriched by RNA-Seq by chance alone? In other words, what is the most prudent way to calculate if my observed overlap between B and C (100 genes) is any better than that obtained by chance alone? From what I have read so far, the best way to test this is by using hypergeometric distribution.

I used an online calculator (stattrek.com) to set up a hypergeometric distribution test with the following parameters:
– pop size=15,000
– # of successes in population=3,000
– sample size=400,
-# of successes in sample=100.
I get the following for Hypergeometric Probability P(x=100)= 0.00224050636447747

The actual # of genes overlapping between B and C = 100. Is this better than by chance alone? Doesn't look like it is if the chance of any one gene being enriched is 1:5 (3,000 out of 15,000). That's why I don't understand how come my P(x=100) I calculated above is 0.0022. That amounts to a 0.2% chance of the overlap occurring by chance. Shouldn't this be much higher?

If I sampled 400 random genes rom the big list of 15,000, then any 80 of these genes would be expected to be enriched by chance alone (1:5). The number of genes that are actually overlapping is 100, so this is just slightly better than by chance.

I also tried to come up with a solution using the dhyper or phyper functions in R (using what I saw in another post):
A=all genes in the genome (15,000)
B=RNA-Seq enriched genes (3,000)
C=ChIP-chip enriched genes (400)
Here's the R input/output (adapted from a previous stackexchange post):

> totalpop <- 15000    
> sample1 <- 3000    
> sample2 <- 400    
> dhyper(0:2, sample1, totalpop-sample1, sample2)    
[1] 4.431784e-40 4.584209e-38 2.364018e-36    
> phyper(-1:2, sample1, totalpop-sample1, sample2)    
[1] 0.000000e+00 4.431784e-40 4.628526e-38 2.410304e-36    

I'm not sure how to interpret these numbers. I believe 2.36e-36 is the probability of getting a complete overlap between B and C by chance alone? But this makes no sense, since that probability is much closer to 1:5. If I start with 15,000 genes, 3,000 will be enriched. Similarly, if I start with 400 ChIP-chip genes, 80 of them should be enriched in the RNA-Seq alone due to the 1:5 chances of enrichment in that data set.

What is the proper way to calculate the p-value, according to the hypergeometric distribution, for the overlap of B and C?

Best Answer

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Related Question