You are close, with your use of dhyper
and phyper
, but I don't understand where 0:2
and -1:2
are coming from.
The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
These give 0.0078.
dhyper(x, m, n, k)
gives the probability of drawing exactly x
. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.
phyper(x, m, n, k)
gives the probability of getting x
or fewer, so phyper(x, m, n, k)
is the same as sum(dhyper(0:x, m, n, k))
.
The lower.tail=FALSE
is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE)
is the same as 1-phyper(x, m, n, k)
, and so is the probability of x+1
or more. [I never remember this and so always have to double check.]
At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."
Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400))
gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.
Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail.
ENCODE Caltech dataset contains two replicates for K562 cell line. Is that the experiment you are using? And how did you select the gene lists that you currently have at hand? Further, do you want to test all the genes in the list as a "set", or do you want to have a single p-value for each gene?
It is very common to select the gene lists on the basis of some statistical test. For example, R add-on packages edgeR, DESeq and limma (from the Bioconductor project) offer suitable methods for rna-seq data, and especially for small sample sizes. There are also other ways to do this, such as, simple filtering based on FPKM values or their standard deviation. Using some statistical test to arrive to list of differentially expressed genes will also simultaneously give you a p-value for each gene, which might be what you are after. Now that you have a single condition only (just one cell line, K562), the statistical test is equivalent to testing whether the gene's expression is different from 1, or 0 if the FPKM values are also log-transformed.
Each gene's statistical significance can also be tested by using a permutation test, where usually the sample labels are shuffled a large number of times. This approach is not really applicable here, since there are only two samples. For more information on a possible implementation, see this page.
In addition, it is also possible to test the whole genelist as a set, and possibly compare it to all the other genes in the experiment. This can be accomplished with, e.g., the package globaltest (from the Bioconductor project) in R.
Best Answer
Well, your function doesn't entirely make sense as written, depending as it does on an undefined global variable
ah
.Assuming that
M
is a matrix of counts, the edgeR User's Guide advises you to use:if your aim is to get normalized quantities for plotting etc.
The User's Guide advises you not to use
equalizeLibSizes
.