I'm studying the prevalence of one parasite on a host from different localities. I do it by assigning presence ("1") or absence ("0") to each host sampled.
After the sampling, I got something like this:
site <- c("A", "A", "B", "B", "B", "C", "C","A", "A", "B", "B", "B", "C", "C",
"A", "A", "B", "B", "B", "C", "C","A", "A", "B", "B", "B", "C", "C")
infection <- c("1", "1", "0", "0", "0", "1", "0", "1", "1", "0", "0", "0", "1", "1",
"1", "1", "0", "0", "1", "1", "1", "1", "0", "1", "0", "0", "0", "1")
table1 <- data.frame (site, infection)
I created a data.frame
with the data to automatize the process:
library(dplyr)
table.by.site <- ddply(table1,
"site",
summarise,
Infected = length(which(infection=="1")),
NonInfected = length(which(infection=="0")))
table.by.site
In total, 3 sites and 2 states of infection. Number of hosts from each site: A= 180; B= 160; C=160.
How can I see if the prevalence of the infections depends on the site of sampling? I have done a fisher.test
, finding significant differences:
table.simple <- table.by.site [,-1] #I remove the "site" column.
fisher.test(table.simple)
But how I perform a pairwise t-test? How can I introduce my grouping factor (site)? Should I use a different approach?
EDIT
I will edit the questions instead open a new one.
Let's say I've been collecting samples from 3 different sites (i.e., A, B, and C) in different moments (i.e., week of the year, from 1 to 52). From each of this sampling trips I could obtain a certain number of hosts (e.g. 20 from each). I examined these hosts and I calculate the prevalence (%) of infected hosts per site and trip.
So, the code to generate the table would be something like this:
week <- c("1","1","2","2","3","3","4","5","5","6","6","6","6","7","8","9",
"10","11","12","13","14","14","14","15","15","15","16","16","16",
"16","16","17","17","18","18","18","18","18")
site <- c("A","A","C","C","B","B","C","B","B","A","A","A","A","B","C","B","C","B",
"C","A","C","B","B","B","A","A","A","A","C","C","C","C","C","A","A","B",
"B","B")
infection <- c(1,0,0,0,1,1,1,0,1,0,1,0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,1,1,1,0,0,0,0,
1,1,0,0,0)
table (week)
raw.table <- data.frame (week, site, infection)
Then I calculate the Prevalence of infection for each site and site:
library(plyr)
table.summary <- ddply(raw.table,
.(week, site),
summarize,
Prevalence = ( (sum (infection)*100) / length(infection)))
table.summary
Now, can I use an ANOVA
to test the differences in prevalence among sites? Is there any problem because they are percentages?
anova.table <- aov(Prevalence ~ site, data=table.summary)
summary(anova.table)
In the example there is no significant differences among site, but if I want to do a post hoc test, I'd use a pairwise comparison, correcting the p-value with Bonferroni
pairwise.t.test(table.summary$Prevalence, table.summary$site, adj.meth="bonferroni")
Is the whole process correct? Should I take additional steps? Are the percentages a problem for this analysis?
Best Answer
I would not use Fisher's exact test for your situation. Fisher's exact test assumes the marginals were fixed in advance (see here), which I doubt is true in your case.
I wonder if you are worried about the validity of the chi-squared test. The rule of thumb for the chi-squared test is that the expected values for the cells should be at least 5, not that the observed cell counts need to be. That rule of thumb is known to be too conservative. In your case, four out of six expected values are less than 5, but not by much, so the chi-squared test is probably fine. Nonetheless, since that is a large proportion of your cells, and you have relatively few data, you could set
simulate.p.value=TRUE
just to be safe:Another way is to fit a logistic regression model. You can determine if the sites differ by using a likelihood ratio test of the model as a whole (see here). Here we see that the sites differ:
To run multiple comparisons on the different sites versus each other, we may want to use the Bonferroni correction to the p-value. This is a bit conservative, but may be reasonable since your contrasts are not a-priori or orthogonal. To control the familywise type I error rate at .05, you want the individual p-values to be less than .0167. With relatively few data, you may not want to trust the normal approximation for the Wald tests in the output above. To check for pairwise differences between the sites, you can select only two sites at a time and use the likelihood ratio test as before. From this, you can see that
B
differs fromA
, and fromC
, but thatC
does not differ fromA
: