I would not use Fisher's exact test for your situation. Fisher's exact test assumes the marginals were fixed in advance (see here), which I doubt is true in your case.
I wonder if you are worried about the validity of the chi-squared test. The rule of thumb for the chi-squared test is that the expected values for the cells should be at least 5, not that the observed cell counts need to be. That rule of thumb is known to be too conservative. In your case, four out of six expected values are less than 5, but not by much, so the chi-squared test is probably fine. Nonetheless, since that is a large proportion of your cells, and you have relatively few data, you could set simulate.p.value=TRUE
just to be safe:
table(site, infection)
# infection
# site 0 1
# A 1 7
# B 10 2
# C 2 6
chisq.test(table(site, infection))$expected
# infection
# site 0 1
# A 3.714286 4.285714
# B 5.571429 6.428571
# C 3.714286 4.285714
chisq.test(table(site, infection), simulate.p.value=TRUE)
# Pearson's Chi-squared test with simulated p-value
# (based on 2000 replicates)
#
# data: table(site, infection)
# X-squared = 11.75, df = NA, p-value = 0.003498
Another way is to fit a logistic regression model. You can determine if the sites differ by using a likelihood ratio test of the model as a whole (see here). Here we see that the sites differ:
infection = factor(infection, levels=c("0","1"))
mod = glm(infection~site, family=binomial)
summary(mod)
# ...
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.9459 1.0690 1.820 0.06872 .
# siteB -3.5553 1.3202 -2.693 0.00708 **
# siteC -0.8473 1.3452 -0.630 0.52878
# ...
# Null deviance: 38.673 on 27 degrees of freedom
# Residual deviance: 25.839 on 25 degrees of freedom
# ...
1-pchisq(38.673-25.839, df=2) # [1] 0.00163355
To run multiple comparisons on the different sites versus each other, we may want to use the Bonferroni correction to the p-value. This is a bit conservative, but may be reasonable since your contrasts are not a-priori or orthogonal. To control the familywise type I error rate at .05, you want the individual p-values to be less than .0167. With relatively few data, you may not want to trust the normal approximation for the Wald tests in the output above. To check for pairwise differences between the sites, you can select only two sites at a time and use the likelihood ratio test as before. From this, you can see that B
differs from A
, and from C
, but that C
does not differ from A
:
.05/3 # [1] 0.01666667
summary(glm(infection~site, family=binomial, subset=site!="C")) # B vs A
# ...
# Null deviance: 27.526 on 19 degrees of freedom
# Residual deviance: 16.842 on 18 degrees of freedom
# ...
1-pchisq(27.526-16.842, df=1) # [1] 0.001080661
summary(glm(infection~site, family=binomial, subset=site!="A")) # B vs C
# ...
# Null deviance: 26.920 on 19 degrees of freedom
# Residual deviance: 19.811 on 18 degrees of freedom
# ...
1-pchisq(26.920-19.811, df=1) # [1] 0.007669788
summary(glm(infection~site, family=binomial, subset=site!="B")) # C vs A
# ...
# Null deviance: 15.442 on 15 degrees of freedom
# Residual deviance: 15.026 on 14 degrees of freedom
# ...
1-pchisq(15.442-15.026, df=1) # [1] 0.5189397
Best Answer
I have now addressed this problem (and similar ones) in
fisher.test()
, and indeed fixed the bug (for "R-devel" only, see the bugs.r-project.org link above). Increasing 'workspace' now does help in fisher.test().... but this example takes a lot of time, if you run the exact test (default hybrid=FALSE):On machines with enough Giga bytes (4 GB free is ok):
I have not had the patience to also try the
hybrid=TRUE
case .. it may take less than the 1h 47min of the exact case.Statistically, the answers above of course are sufficient and indeed, in many such cases, to use
simulate.p.value = TRUE
is more realistic here.