Hopefully someone on these forums can help me out with this basic problem in gene expression studies.
I did deep sequencing of an experimental and a control tissue. I then obtained fold enrichment values of genes in the experimental sample over control. The reference genome has ~15,000 genes. 3,000 out of 15,000 genes are enriched above a certain cut-off in my sample of interest compared to control.
So:
A = total gene population = 15,000
B = RNA-Seq enriched subpopulation = 3,000.
In a previous ChIP-chip experiment, I found 400 genes that are enriched by ChIP-chip. Of the 400 ChIP-chip genes, 100 genes are in the group of 3,000 enriched RNA-Seq transcripts.
So:
C= total # of ChIP-chip enriched genes = 400.
What is the probability that my 100 ChIP-chip genes would be enriched by RNA-Seq by chance alone? In other words, what is the most prudent way to calculate if my observed overlap between B and C (100 genes) is any better than that obtained by chance alone? From what I have read so far, the best way to test this is by using hypergeometric distribution.
I used an online calculator (stattrek.com) to set up a hypergeometric distribution test with the following parameters:
– pop size=15,000
– # of successes in population=3,000
– sample size=400,
-# of successes in sample=100.
I get the following for Hypergeometric Probability P(x=100)= 0.00224050636447747
The actual # of genes overlapping between B and C = 100. Is this better than by chance alone? Doesn't look like it is if the chance of any one gene being enriched is 1:5 (3,000 out of 15,000). That's why I don't understand how come my P(x=100) I calculated above is 0.0022. That amounts to a 0.2% chance of the overlap occurring by chance. Shouldn't this be much higher?
If I sampled 400 random genes rom the big list of 15,000, then any 80 of these genes would be expected to be enriched by chance alone (1:5). The number of genes that are actually overlapping is 100, so this is just slightly better than by chance.
I also tried to come up with a solution using the dhyper or phyper functions in R (using what I saw in another post):
A=all genes in the genome (15,000)
B=RNA-Seq enriched genes (3,000)
C=ChIP-chip enriched genes (400)
Here's the R input/output (adapted from a previous stackexchange post):
> totalpop <- 15000
> sample1 <- 3000
> sample2 <- 400
> dhyper(0:2, sample1, totalpop-sample1, sample2)
[1] 4.431784e-40 4.584209e-38 2.364018e-36
> phyper(-1:2, sample1, totalpop-sample1, sample2)
[1] 0.000000e+00 4.431784e-40 4.628526e-38 2.410304e-36
I'm not sure how to interpret these numbers. I believe 2.36e-36 is the probability of getting a complete overlap between B and C by chance alone? But this makes no sense, since that probability is much closer to 1:5. If I start with 15,000 genes, 3,000 will be enriched. Similarly, if I start with 400 ChIP-chip genes, 80 of them should be enriched in the RNA-Seq alone due to the 1:5 chances of enrichment in that data set.
What is the proper way to calculate the p-value, according to the hypergeometric distribution, for the overlap of B and C?
Best Answer
You are close, with your use of
dhyper
andphyper
, but I don't understand where0:2
and-1:2
are coming from.The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
These give 0.0078.
dhyper(x, m, n, k)
gives the probability of drawing exactlyx
. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.phyper(x, m, n, k)
gives the probability of gettingx
or fewer, sophyper(x, m, n, k)
is the same assum(dhyper(0:x, m, n, k))
.The
lower.tail=FALSE
is a bit confusing.phyper(x, m, n, k, lower.tail=FALSE)
is the same as1-phyper(x, m, n, k)
, and so is the probability ofx+1
or more. [I never remember this and so always have to double check.]At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."
Any particular number that you draw is going to have small probability (in fact,
max(dhyper(0:400, 3000, 12000, 400))
gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail.