Solved – Comparing prevalence among sites. Fisher.test and pairwise.t.test

fishers-exact-testmultiple-comparisonspercentager

I'm studying the prevalence of one parasite on a host from different localities. I do it by assigning presence ("1") or absence ("0") to each host sampled.

After the sampling, I got something like this:

site      <- c("A", "A", "B", "B", "B", "C", "C","A", "A", "B", "B", "B", "C", "C", 
               "A", "A", "B", "B", "B", "C", "C","A", "A", "B", "B", "B", "C", "C")
infection <- c("1", "1", "0", "0", "0", "1", "0", "1", "1", "0", "0", "0", "1", "1", 
               "1", "1", "0", "0", "1", "1", "1", "1", "0", "1", "0", "0", "0", "1")
table1    <- data.frame (site, infection)

I created a data.frame with the data to automatize the process:

library(dplyr)
table.by.site <- ddply(table1, 
                       "site", 
                       summarise,
                       Infected = length(which(infection=="1")),
                       NonInfected = length(which(infection=="0")))
table.by.site

In total, 3 sites and 2 states of infection. Number of hosts from each site: A= 180; B= 160; C=160.

How can I see if the prevalence of the infections depends on the site of sampling? I have done a fisher.test, finding significant differences:

table.simple <- table.by.site [,-1] #I remove the "site" column.
fisher.test(table.simple)

But how I perform a pairwise t-test? How can I introduce my grouping factor (site)? Should I use a different approach?

EDIT

I will edit the questions instead open a new one.

Let's say I've been collecting samples from 3 different sites (i.e., A, B, and C) in different moments (i.e., week of the year, from 1 to 52). From each of this sampling trips I could obtain a certain number of hosts (e.g. 20 from each). I examined these hosts and I calculate the prevalence (%) of infected hosts per site and trip.

So, the code to generate the table would be something like this:

week      <- c("1","1","2","2","3","3","4","5","5","6","6","6","6","7","8","9",
               "10","11","12","13","14","14","14","15","15","15","16","16","16",
               "16","16","17","17","18","18","18","18","18")
site      <- c("A","A","C","C","B","B","C","B","B","A","A","A","A","B","C","B","C","B",
               "C","A","C","B","B","B","A","A","A","A","C","C","C","C","C","A","A","B",
               "B","B")
infection <- c(1,0,0,0,1,1,1,0,1,0,1,0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,1,1,1,0,0,0,0,
               1,1,0,0,0)
table (week)
raw.table <- data.frame (week, site, infection)

Then I calculate the Prevalence of infection for each site and site:

library(plyr)
table.summary <- ddply(raw.table,
                       .(week, site),
                       summarize,
                       Prevalence = ( (sum (infection)*100) / length(infection)))
table.summary

Now, can I use an ANOVA to test the differences in prevalence among sites? Is there any problem because they are percentages?

anova.table <- aov(Prevalence ~ site, data=table.summary)
summary(anova.table)

In the example there is no significant differences among site, but if I want to do a post hoc test, I'd use a pairwise comparison, correcting the p-value with Bonferroni

pairwise.t.test(table.summary$Prevalence, table.summary$site, adj.meth="bonferroni")

Is the whole process correct? Should I take additional steps? Are the percentages a problem for this analysis?

Best Answer

I would not use Fisher's exact test for your situation. Fisher's exact test assumes the marginals were fixed in advance (see here), which I doubt is true in your case.

I wonder if you are worried about the validity of the chi-squared test. The rule of thumb for the chi-squared test is that the expected values for the cells should be at least 5, not that the observed cell counts need to be. That rule of thumb is known to be too conservative. In your case, four out of six expected values are less than 5, but not by much, so the chi-squared test is probably fine. Nonetheless, since that is a large proportion of your cells, and you have relatively few data, you could set simulate.p.value=TRUE just to be safe:

table(site, infection)
#     infection
# site  0  1
#    A  1  7
#    B 10  2
#    C  2  6
chisq.test(table(site, infection))$expected
#     infection
# site        0        1
#    A 3.714286 4.285714
#    B 5.571429 6.428571
#    C 3.714286 4.285714
chisq.test(table(site, infection), simulate.p.value=TRUE)
#  Pearson's Chi-squared test with simulated p-value
#  (based on 2000 replicates)
# 
# data:  table(site, infection)
# X-squared = 11.75, df = NA, p-value = 0.003498

Another way is to fit a logistic regression model. You can determine if the sites differ by using a likelihood ratio test of the model as a whole (see here). Here we see that the sites differ:

infection = factor(infection, levels=c("0","1"))
mod       = glm(infection~site, family=binomial)
summary(mod)
# ...
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)   
# (Intercept)   1.9459     1.0690   1.820  0.06872 . 
# siteB        -3.5553     1.3202  -2.693  0.00708 **
# siteC        -0.8473     1.3452  -0.630  0.52878   
# ...
#     Null deviance: 38.673  on 27  degrees of freedom
# Residual deviance: 25.839  on 25  degrees of freedom
# ...
1-pchisq(38.673-25.839, df=2)  # [1] 0.00163355

To run multiple comparisons on the different sites versus each other, we may want to use the Bonferroni correction to the p-value. This is a bit conservative, but may be reasonable since your contrasts are not a-priori or orthogonal. To control the familywise type I error rate at .05, you want the individual p-values to be less than .0167. With relatively few data, you may not want to trust the normal approximation for the Wald tests in the output above. To check for pairwise differences between the sites, you can select only two sites at a time and use the likelihood ratio test as before. From this, you can see that B differs from A, and from C, but that C does not differ from A:

.05/3  # [1] 0.01666667
summary(glm(infection~site, family=binomial, subset=site!="C"))  # B vs A
# ...
#     Null deviance: 27.526  on 19  degrees of freedom
# Residual deviance: 16.842  on 18  degrees of freedom
# ...
1-pchisq(27.526-16.842, df=1)  # [1] 0.001080661
summary(glm(infection~site, family=binomial, subset=site!="A"))  # B vs C
# ...
#     Null deviance: 26.920  on 19  degrees of freedom
# Residual deviance: 19.811  on 18  degrees of freedom
# ...
1-pchisq(26.920-19.811, df=1)  # [1] 0.007669788
summary(glm(infection~site, family=binomial, subset=site!="B"))  # C vs A
# ...
#     Null deviance: 15.442  on 15  degrees of freedom
# Residual deviance: 15.026  on 14  degrees of freedom
# ...
1-pchisq(15.442-15.026, df=1)  # [1] 0.5189397