Solved – Comparing p-values for Fisher’s exact test and test of equal proportions

fishers-exact-testproportion;rself-study

I'm comparing success rates for two repeated experiments with $n_1$ and $n_2$ successes out of $N$ trials. $N$ is in the order of $10^7$. I don't know beforehand which experiment has a lower success rate, so I'd probably use a two-sided test. I see two options:

  1. The test of equal proportions (in R: prop.test(c(n1, n2), c(N, N)))
  2. Fisher's exact test (in R: fisher.test(matrix(c(n1, n2, N-n1, N-n2), ncol = 2)))

Now, for $N$ that large, Fisher's exact test is slow. (The implementation seems to evaluate the density of the hypergeometric distribution on a support of the order of $N$.) However, the test of equal proportions seems to have less power.

Does the test of equal proportions always return a p-value not less than that of Fisher's exact test for the same data? Is there a more powerful alternative to the test of equal proportions that isn't that expensive computation-wise?

EDIT: A computational test on 1000 matrices with "almost equal" entries suggests that the p-value computed by Fisher's test is almost always, but not always, larger. I'm still looking for stronger argumentation, and a faster/more powerful test.

Best Answer

prop.test uses a Pearson chi-square test. This is an asymptotic test. It will be worst when you have small samples or get too near the tails. Fishers will always be "better" because it is an "exact" test that does not rely upon asymptotic arguments to obtain its p-values...rather, it computes all the ways the table could have come about and then finds the proportion that were as-or-more-extreme.

Practically, this will result in Fisher's being less "powerful" when it matters because Pearson's approximation is most wrong in exactly those cases.

I do not know why fisher.test should take so long. For sample sizes on the order of $10^7$, it should have dropped to approximate methods unless the events are really rare. Are they? An alternative might be binom.test which uses Fisher's and may swap algorithms when sample sizes get large and event rates are still common. That might speed things up. A MonteCarlo version might work, also.

In your case and for sample sizes this high and non-rare events, Fisher's and Pearson's should not disagree to any real extent but I'd request the continuity-correction on Pearson prop.test(..., correct=TRUE). Try your simulation with this option and see if there is a dime's worth of difference then.

Another option is Barnard's unconditional test which can be more powerful but which many people frown at (even Barnard) though their cited reasons are often esoteric. In any case, that is not likely to be faster than either Pearson or Fisher.