I am completely out of my depth on this, and all the reading I try to do just confuses me. I'm hoping you can explain things to me in a way that makes sense. (As always seems to be the case, "It shouldn't be this hard!")
I'm trying to help a student who is looking at the effect of social systems on prevalence of diseases in various canid host species. We want to consider social system (e.g., group-living vs. solitary) as a fixed effect, and host species as a random effect nested within social system (i.e., each species only ever has one social system type).
My understanding is that the best way to do this would be to do a mixed-effects logistic regression. We've done this, and it works, and we were happy. Unfortunately, her advisor is insisting that she calculate the amount of variation due to social system vs. host species vs. residual. I can't figure out how to do this via mixed-effects logistic regression, and my previous question on this topic went unanswered.
Her advisor suggested doing ANOVA instead, logit-transforming disease prevalence values (the fraction of each population that is infected). This presented a problem because some of the prevalence values are 0 or 1, which would result in $-\infty$ or $\infty$ once logit-transformed. Her advisor's "solution" was to just substitute $-5$ and $5$ for $-\infty$ or $\infty$, respectively. This feels really kludgey and makes me cringe pretty hard. But he's the one grading her, and at this point I just want to be done with this, so if he's fine with it then whatever.
We are using R for this analysis. The code can be downloaded here, and the input data here. The data file includes data on two different pathogens (A and B), which we are analyzing separately (as shown in the code).
Here's the ANOVA setup we made for Pathogen B:
mod1.lm <- lm(Seroprevalence_logit ~ Social.System + Social.System/Host.Species,
data = prev_B)
print(mod1.anova <- anova(mod1.lm))
This leads to my first question: Is this correct and appropriate? Factors to consider:
- We want to have a Model II (random effect) variable nested within a Model I (fixed effect) variable.
- Not every social system has the same number of host species nested within it.
- Not every host species has the same number of populations examined.
- Not every population examined had the same number of individuals (column N_indiv in mydata.csv). This is more of a weighting problem than something more fundamental, I think.
My next question, and the main one of this post, is: How do I partition the variance? Here's what we were thinking:
MS_A <- mod1.anova$"Mean Sq"[1]
MS_BinA <- mod1.anova$"Mean Sq"[2]
MS_resid <- mod1.anova$"Mean Sq"[3]
n <- length(unique(prev_A$Social.System))
r <- length(unique(prev_A$Host.Species))
VC_A <- (MS_A - MS_BinA)/(n*r)
VC_BinA <- (MS_BinA - MS_resid)/n
VC_resid <- MS_resid
Unfortunately, this results in sadness using the ANOVA specification I detailed above. Here are the results for Pathogen B:
VC_A
(i.e., Social.System): $-1.48$VC_BinA
(i.e., Host.Species): $13.8$VC_resid
: $5.57$
Research leads me to believe that this should result in variance component percentages of 0%, 71.3%, and 28.7%, respectively. However, this is unsatisfying for two reasons:
- The p-value for Social.System from the ANOVA was ~$0.025$, suggesting that it should account for at least some of the observed variance. (Host.Species had a p-value of ~$3*10^{-5}$.)
- I'm concerned that a negative variance component might be a red flag for something.
Please, any assistance you can render on either of these questions would be greatly appreciated. I TA'd an undergraduate course on biostatistics, so I've got some background, but I just can't seem to figure out these specific issues. Thanks in advance.
Best Answer
Hopefully your friend has graduated by now, but if not, the following might help.
You were on the right track in your original post Partitioning variance from logistic regression, using
glmer()
for mixed-effects logistic regression.I would recommend against: the advisor's "solution", using lm() for logistic regression, and weighting rows equally (you should weight by N_indiv).
Generalized linear mixed models are tough. http://glmm.wikidot.com/faq has some good information - including the fact that you need many levels of a random factor in order to estimate its variance component.
My code below requires the lme4 package and the data from your link.