Solved – Nested ANOVA: Unequal sample sizes? Variance components

anovanested datapartitioningr

I am completely out of my depth on this, and all the reading I try to do just confuses me. I'm hoping you can explain things to me in a way that makes sense. (As always seems to be the case, "It shouldn't be this hard!")

I'm trying to help a student who is looking at the effect of social systems on prevalence of diseases in various canid host species. We want to consider social system (e.g., group-living vs. solitary) as a fixed effect, and host species as a random effect nested within social system (i.e., each species only ever has one social system type).

My understanding is that the best way to do this would be to do a mixed-effects logistic regression. We've done this, and it works, and we were happy. Unfortunately, her advisor is insisting that she calculate the amount of variation due to social system vs. host species vs. residual. I can't figure out how to do this via mixed-effects logistic regression, and my previous question on this topic went unanswered.

Her advisor suggested doing ANOVA instead, logit-transforming disease prevalence values (the fraction of each population that is infected). This presented a problem because some of the prevalence values are 0 or 1, which would result in $-\infty$ or $\infty$ once logit-transformed. Her advisor's "solution" was to just substitute $-5$ and $5$ for $-\infty$ or $\infty$, respectively. This feels really kludgey and makes me cringe pretty hard. But he's the one grading her, and at this point I just want to be done with this, so if he's fine with it then whatever.

We are using R for this analysis. The code can be downloaded here, and the input data here. The data file includes data on two different pathogens (A and B), which we are analyzing separately (as shown in the code).

Here's the ANOVA setup we made for Pathogen B:

mod1.lm <- lm(Seroprevalence_logit ~ Social.System + Social.System/Host.Species,
              data = prev_B)
print(mod1.anova <- anova(mod1.lm))

This leads to my first question: Is this correct and appropriate? Factors to consider:

  • We want to have a Model II (random effect) variable nested within a Model I (fixed effect) variable.
  • Not every social system has the same number of host species nested within it.
  • Not every host species has the same number of populations examined.
  • Not every population examined had the same number of individuals (column N_indiv in mydata.csv). This is more of a weighting problem than something more fundamental, I think.

My next question, and the main one of this post, is: How do I partition the variance? Here's what we were thinking:

MS_A <- mod1.anova$"Mean Sq"[1]
MS_BinA <- mod1.anova$"Mean Sq"[2]
MS_resid <- mod1.anova$"Mean Sq"[3]
n <- length(unique(prev_A$Social.System))
r <- length(unique(prev_A$Host.Species))
VC_A <- (MS_A - MS_BinA)/(n*r)
VC_BinA <- (MS_BinA - MS_resid)/n
VC_resid <- MS_resid

Unfortunately, this results in sadness using the ANOVA specification I detailed above. Here are the results for Pathogen B:

  • VC_A (i.e., Social.System): $-1.48$
  • VC_BinA (i.e., Host.Species): $13.8$
  • VC_resid: $5.57$

Research leads me to believe that this should result in variance component percentages of 0%, 71.3%, and 28.7%, respectively. However, this is unsatisfying for two reasons:

  • The p-value for Social.System from the ANOVA was ~$0.025$, suggesting that it should account for at least some of the observed variance. (Host.Species had a p-value of ~$3*10^{-5}$.)
  • I'm concerned that a negative variance component might be a red flag for something.

Please, any assistance you can render on either of these questions would be greatly appreciated. I TA'd an undergraduate course on biostatistics, so I've got some background, but I just can't seem to figure out these specific issues. Thanks in advance.

Best Answer

Hopefully your friend has graduated by now, but if not, the following might help.

You were on the right track in your original post Partitioning variance from logistic regression, using glmer() for mixed-effects logistic regression.

I would recommend against: the advisor's "solution", using lm() for logistic regression, and weighting rows equally (you should weight by N_indiv).

Generalized linear mixed models are tough. http://glmm.wikidot.com/faq has some good information - including the fact that you need many levels of a random factor in order to estimate its variance component.

My code below requires the lme4 package and the data from your link.

# Seroprevalance has been rounded, that's not OK
# to do logistic regression, (proportion * weight) must equal an integer
prev$seroexact <- round(prev$Seroprevalence * prev$N_indiv)/prev$N_indiv

# Host.Species is nested within Social.system, but you didn't reuse 
# species letters between Social.systems, so you can specify 
# Host.Species as a random effect without explicitly nesting it

# First random effect model
prev1.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species),
                  family=binomial(link="logit"), weights=N_indiv, data = prev)
summary(prev1.glmer)

## Fixed effects:
# Intercept is pathogen A and social.system A.  
# The z-test of the intercept is testing if the logit=0
# I.e. it's testing whether the combination of
# pathogen A and social.system A has prob=0.5.
# The other z-tests are testing whether other levels of the factors
# yield different probabilities than pathogen A and social.system A

## Random effects:
# This doesn't give you separate Host.Species and residual variances,
# Host.Species is treated as a random effect, so this model is the same as if
# you had summed the results of all studies with identical values of
# Host.Species, Pathogen, and Social.System. I.e. sum the results of the
# first 8 rows and create a single proportion and N_indiv, like so:

prevsum<-aggregate(cbind(N_indiv, prop=(seroexact*N_indiv)) ~ 
                   Social.System+Host.Species+Pathogen, data=prev, sum)
prevsum$prop<-prevsum$prop/prevsum$N_indiv

# which gives the same model:
prevsum.glmer = glmer(prop ~ Pathogen + Social.System + (1|Host.Species),
                      family=binomial(link="logit"), weights=N_indiv, data = prevsum)
summary(prevsum.glmer)

# So why are they broken up into multiple rows?  If each row represents
# one geographic area/time/litter/study/etc. then animals in one row
# might be more similar to eachother than they are to animals in
# another row that has the same values of Social, Species, & Pathogen.
# I think this is what the advisor wants as a "residual".

# To allow a random component for each row:
prev2<-cbind(resid=paste("Row_", row.names(prev), sep=""), prev)

prev2.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                   family=binomial(link="logit"), weights=N_indiv, data = prev2)
summary(prev2.glmer)

# This isn't a bad start, but I'm not comfortable with it because:
table(prev2[,2:3])

# Social.Sytstem D is only observed in Species F.
# This is called confounding, and it makes it hard to draw conclusions
# about Social Sytstem D.  How do you know what is caused by social
# system D and what is caused by species F?  If your friend really wants to
# make inferences about Social System D, she should collect data from
# another host species that uses Social System D.

# Leave out Soc_D:
prev3.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev3.glmer)

# Even though Host Species is conceptually a random factor, you really need to observe
# more than 2 species per social system for a mixed model to accurately estimate
# the species variance.  As far as species variance is concerned, each species is a
# single sample (not animals or even litters), and you can't hope to estimate variance
# accurately with only two samples.

# We can fit the model with species as a fixed effect, but we don't have
# enough degrees of freedom to estimate all levels of Species:
prev4.glmer = glmer(seroexact ~ Pathogen + Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# Your friend doesn't need to estimate the level of each species in order to test
# whether species has any noticeable effect at all.  Unfortunately, we can't just
# Use the F statistic from anova() because calculating the denominator df for a
# GLMM is not straightforward.
anova(prev4.glmer) #Gives you an F statistic, but no denominator df or p-value

# Instead we fit a simpler model without Species:
prev5.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# And we'll compare the two models With a Likelihood-Ratio test using anova()
anova(prev5.glmer,prev4.glmer)

# With a p-value of 0.01331 we can say it's worth keeping Species in the model.

# Now let's check the pathogen * social system interaction:
prev6.glmer = glmer(seroexact ~ Pathogen * Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, nAGQ=2, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev6.glmer) #Neither interaction term is significant
anova(prev6.glmer)
# We don't need a denominator df to know that the F statistic of 0.0774 for
# the interaction is insignificant.

# Since the interaction between Pathogen and Social System was not significant,
# we don't need to include the interaction term.  Similarly, I don't see a 
# statistical reason to  split the model into two separate 'pathogen specific'
# models, but maybe there's a scientific reason to do so:

# Separate tests for each pathogen:
prev7A.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_A",])
summary(prev7A.glmer)
# Social System B looks different from Social System A in pathogen A prevalance:

# Calculate the odds of having Pathogen A for Social System A vs B
beta7A<-fixef(prev7A.glmer)
exp(-beta7A[2]) #negative sign means odds of A:B instead of B:A
# So animals with Social System A have about 25 times the odds of
# animals with social system B of having Pathogen A

# Test for Pathogen B:
prev7B.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_B",])
summary(prev7B.glmer)
# The only significant effects are species-specific, which are not of interest

# Let's return to prev4.glmer, which models both pathogens:
summary(prev4.glmer)

# The only significant fixed effect in prev4.glmer is Pathogen.
beta4<-fixef(prev4.glmer)

# For a randomly selected animal, the odds of having Pathogen B to having Pathogen A are:
exp(beta4[2])

# That's about as much as you can interpret with the data she has.

# To answer the Advisor's request for variance components:
# Residual variance is:
getME(prev4.glmer, "theta")^2

# You can't do a good job of estimating species variance with these data.
# If her advisor won't listen, then you can tell him that your estimate is:
getME(prev3.glmer, "theta")[2]^2
# But it's a really crappy estimate.

# There is no such thing as a variance component for Social System because
# it's a fixed effect.  But you can get its sum of squares:
anova(prev4.glmer)
Related Question