Solved – Fisher’s exact test for gene rank enrichment

fishers-exact-testgeneticsranking

I'm trying some metrics to filter a list of gene names with associated numerical values
which correspond to their abundance.

I'm looking for some interesting genes and trying to see if they come up to the top region of the ranked gene list (ranked by abundance)
before and after filtering. Please see Fig. enclosed.
enter image description here

I wish to see if the ranking of my gene of interest (G) has changed significantly before and after filtering and I'm thinking
to do a Fisher's Exact test for proportions to ascertain the significance.

So given the rank of the gene of interest I'm willing to look at the number of genes above and below the gene in rank
with and without the filtering and compare their proportions using FE Test

If in the UN-Filtered data there are 200 and 2800 genes above and below G and 30 and 100 in the Filtered data respectively
then I can do the FE test as follows (in R):

>my.mat <- matrix(c(30,100,200,2800),nrow=2,byrow=TRUE)
>my.mat
     [,1] [,2]
[1,]   30  100
[2,]  200 2800

>fisher.test(my.mat,alternative="two.sided")

Please let me know whether I can do a Fisher Exact test AT ALL for such kind of measurement ?

Thanks in advance

Best Answer

Your problem suffers from two main drawbacks which advocate against the usage of Fisher's Exact test.

The dependence between the two cases for which you want to see the difference (same genes, pre- and post-filtering)
The fact that the number of observations in the contigency table are quite unbalanced (a large number of genes is excluded after data filtering)

Although in practice there shouldn't be a great difference if a significant result is out there, such issues might render the Fisher's Exact test unstable. What you could use instead, is the hypergeometric test or the McNemar's test for 2x2 contingency tables. For details regarding the two, if you do a google search, you can find many thorough articles describing them. Regarding the hypergeomtric test and gene set enrichment analysis, you will find that it is mainly applied for Gene Ontology. However, the setting described here is not much different.

Regarding the application of the hypergeometric test, if we assume that the un-filtered set is the "reference" set and the filtered test is the "significant" test and if we classify as "successes" the genes that are above the rank, let n and x be the numbers of genes below and above the rank in the unfiltered list respectively and t and z the same numbers in the filtered list. Then, the probability of your rank being "different" in the two sets (in Gene Ontology analysis literature, you will find this as "over-representation") is given by:

$$1-P_{hypergeometric}(Z<z)= 1- \sum_{z=0}^t\frac{{t \choose z}{n \choose x}}{{n+t \choose z+x}} $$

In R:

p <- 1 - phyper(30,200,2800,30+100)

which is similar (in terms of significance) to the value you get by your code but to my opinion more accurate.

Regarding the McNemar test, you can also try it in R (it is already in the stats package:

mcnemar.test(my.mat)

Related Solutions

Solved – Fisher’s exact test gives non-uniform p-values

The problem is the data are discrete so histograms can be deceiving. I coded a simulation with qqplots that show an approximate uniform distribution.

library(lattice)
set.seed(5545)
TotalNo=300
TotalYes=450

pvalueChi=rep(NA,10000)
pvalueFish=rep(NA,10000)

for(i in 1:10000){
  MaleAndNo=rbinom(1,TotalNo,.3)
  FemaleAndNo=TotalNo-MaleAndNo
  MaleAndYes=rbinom(1,TotalYes,.3)
  FemaleAndYes=TotalYes-MaleAndYes
  x=matrix(c(MaleAndNo,FemaleAndNo,MaleAndYes,FemaleAndYes),nrow=2,ncol=2)
  pvalueChi[i]=chisq.test(x)$p.value
  pvalueFish[i]=fisher.test(x)$p.value
}

dat=data.frame(pvalue=c(pvalueChi,pvalueFish),type=rep(c('Chi-Squared','Fishers'),each=10000))
histogram(~pvalue|type,data=dat,breaks=10)
qqmath(~pvalue|type,data=dat,distribution=qunif,
       panel = function(x, ...) {
         panel.qqmathline(x, ...)
         panel.qqmath(x, ...)
       })

enter image description here

Solved – Comparing p-values for Fisher’s exact test and test of equal proportions

prop.test uses a Pearson chi-square test. This is an asymptotic test. It will be worst when you have small samples or get too near the tails. Fishers will always be "better" because it is an "exact" test that does not rely upon asymptotic arguments to obtain its p-values...rather, it computes all the ways the table could have come about and then finds the proportion that were as-or-more-extreme.

Practically, this will result in Fisher's being less "powerful" when it matters because Pearson's approximation is most wrong in exactly those cases.

I do not know why fisher.test should take so long. For sample sizes on the order of $10^7$, it should have dropped to approximate methods unless the events are really rare. Are they? An alternative might be binom.test which uses Fisher's and may swap algorithms when sample sizes get large and event rates are still common. That might speed things up. A MonteCarlo version might work, also.

In your case and for sample sizes this high and non-rare events, Fisher's and Pearson's should not disagree to any real extent but I'd request the continuity-correction on Pearson prop.test(..., correct=TRUE). Try your simulation with this option and see if there is a dime's worth of difference then.

Another option is Barnard's unconditional test which can be more powerful but which many people frown at (even Barnard) though their cited reasons are often esoteric. In any case, that is not likely to be faster than either Pearson or Fisher.

Best Answer

Related Solutions

Solved – Fisher’s exact test gives non-uniform p-values

Solved – Comparing p-values for Fisher’s exact test and test of equal proportions

Related Question