I'm comparing two datasets from DNA sequencing studies, and comparing mutation rates in genes between the two datasets, which I'm doing using a two-tailed Fisher's exact test (please correct me if I'm wrong in using it in this situation!). I've run the test in R using the fisher.test function, and have included a subset of the data and output below:
Dataset1: n=817
Dataset2: n=18
MutationsDataset1 MutationsDataset2 p-value
GeneA 282 1 0.00975201620794552
GeneB 280 5 0.626542416245188
GeneC 62 4 0.04683126626377
GeneD 50 3 0.100176241063714
GeneE 47 1 1
GeneF 42 1 0.617780181704477
GeneG 41 1 0.608902818182774
GeneH 41 1 0.0384567660866955
GeneI 21 6 9.12505956956652e-06
My question is, why do I get p=1 for GeneE? Shouldn't a p-value never reach 1 or 0 (only converge on it)? Is this just R rounding up from 0.99999…?
This can be replicated as follows:
df<-data.frame(x=c(47, (817-47)), y=c(1, (18-1)))
fisher.test(df, alternative="two.sided")
The table for GeneE is as follows:
Dataset1 Dataset2
Mutated 47 1
NotMutated 770 17
Best Answer
In any randomization test, the probability is the proportion of possible outcomes (given the data but not given the assignment to conditions) as extreme or more extreme than the actual data. If the one in the data is the least extreme, p = 1. It is more of a proportion than a probability in the mathematical sense.
The ratios of dataset 1 to 2 are 47:1 and 45.3:1. That's as close as they can be given the column totals (817 and 18) and the row totals (48 and 787).