I'm trying some metrics to filter a list of gene names with associated numerical values
which correspond to their abundance.
I'm looking for some interesting genes and trying to see if they come up to the top region of the ranked gene list (ranked by abundance)
before and after filtering. Please see Fig. enclosed.
I wish to see if the ranking of my gene of interest (G) has changed significantly before and after filtering and I'm thinking
to do a Fisher's Exact test for proportions to ascertain the significance.
So given the rank of the gene of interest I'm willing to look at the number of genes above and below the gene in rank
with and without the filtering and compare their proportions using FE Test
If in the UN-Filtered data there are 200 and 2800 genes above and below G and 30 and 100 in the Filtered data respectively
then I can do the FE test as follows (in R):
>my.mat <- matrix(c(30,100,200,2800),nrow=2,byrow=TRUE)
>my.mat
[,1] [,2]
[1,] 30 100
[2,] 200 2800
>fisher.test(my.mat,alternative="two.sided")
Please let me know whether I can do a Fisher Exact test AT ALL for such kind of measurement ?
Thanks in advance
Best Answer
Your problem suffers from two main drawbacks which advocate against the usage of Fisher's Exact test.
Although in practice there shouldn't be a great difference if a significant result is out there, such issues might render the Fisher's Exact test unstable. What you could use instead, is the hypergeometric test or the McNemar's test for 2x2 contingency tables. For details regarding the two, if you do a google search, you can find many thorough articles describing them. Regarding the hypergeomtric test and gene set enrichment analysis, you will find that it is mainly applied for Gene Ontology. However, the setting described here is not much different.
Regarding the application of the hypergeometric test, if we assume that the un-filtered set is the "reference" set and the filtered test is the "significant" test and if we classify as "successes" the genes that are above the rank, let n and x be the numbers of genes below and above the rank in the unfiltered list respectively and t and z the same numbers in the filtered list. Then, the probability of your rank being "different" in the two sets (in Gene Ontology analysis literature, you will find this as "over-representation") is given by:
$$1-P_{hypergeometric}(Z<z)= 1- \sum_{z=0}^t\frac{{t \choose z}{n \choose x}}{{n+t \choose z+x}} $$
In R:
which is similar (in terms of significance) to the value you get by your code but to my opinion more accurate.
Regarding the McNemar test, you can also try it in R (it is already in the
stats
package: