Solved – Nested ANOVA: Unequal sample sizes? Variance components

anovanested datapartitioningr

I am completely out of my depth on this, and all the reading I try to do just confuses me. I'm hoping you can explain things to me in a way that makes sense. (As always seems to be the case, "It shouldn't be this hard!")

I'm trying to help a student who is looking at the effect of social systems on prevalence of diseases in various canid host species. We want to consider social system (e.g., group-living vs. solitary) as a fixed effect, and host species as a random effect nested within social system (i.e., each species only ever has one social system type).

My understanding is that the best way to do this would be to do a mixed-effects logistic regression. We've done this, and it works, and we were happy. Unfortunately, her advisor is insisting that she calculate the amount of variation due to social system vs. host species vs. residual. I can't figure out how to do this via mixed-effects logistic regression, and my previous question on this topic went unanswered.

Her advisor suggested doing ANOVA instead, logit-transforming disease prevalence values (the fraction of each population that is infected). This presented a problem because some of the prevalence values are 0 or 1, which would result in $-\infty$ or $\infty$ once logit-transformed. Her advisor's "solution" was to just substitute $-5$ and $5$ for $-\infty$ or $\infty$, respectively. This feels really kludgey and makes me cringe pretty hard. But he's the one grading her, and at this point I just want to be done with this, so if he's fine with it then whatever.

We are using R for this analysis. The code can be downloaded here, and the input data here. The data file includes data on two different pathogens (A and B), which we are analyzing separately (as shown in the code).

Here's the ANOVA setup we made for Pathogen B:

mod1.lm <- lm(Seroprevalence_logit ~ Social.System + Social.System/Host.Species,
              data = prev_B)
print(mod1.anova <- anova(mod1.lm))

This leads to my first question: Is this correct and appropriate? Factors to consider:

We want to have a Model II (random effect) variable nested within a Model I (fixed effect) variable.
Not every social system has the same number of host species nested within it.
Not every host species has the same number of populations examined.
Not every population examined had the same number of individuals (column N_indiv in mydata.csv). This is more of a weighting problem than something more fundamental, I think.

My next question, and the main one of this post, is: How do I partition the variance? Here's what we were thinking:

MS_A <- mod1.anova$"Mean Sq"[1]
MS_BinA <- mod1.anova$"Mean Sq"[2]
MS_resid <- mod1.anova$"Mean Sq"[3]
n <- length(unique(prev_A$Social.System))
r <- length(unique(prev_A$Host.Species))
VC_A <- (MS_A - MS_BinA)/(n*r)
VC_BinA <- (MS_BinA - MS_resid)/n
VC_resid <- MS_resid

Unfortunately, this results in sadness using the ANOVA specification I detailed above. Here are the results for Pathogen B:

VC_A (i.e., Social.System): $-1.48$
VC_BinA (i.e., Host.Species): $13.8$
VC_resid: $5.57$

Research leads me to believe that this should result in variance component percentages of 0%, 71.3%, and 28.7%, respectively. However, this is unsatisfying for two reasons:

The p-value for Social.System from the ANOVA was ~$0.025$, suggesting that it should account for at least some of the observed variance. (Host.Species had a p-value of ~$3*10^{-5}$.)
I'm concerned that a negative variance component might be a red flag for something.

Please, any assistance you can render on either of these questions would be greatly appreciated. I TA'd an undergraduate course on biostatistics, so I've got some background, but I just can't seem to figure out these specific issues. Thanks in advance.

Best Answer

Hopefully your friend has graduated by now, but if not, the following might help.

You were on the right track in your original post Partitioning variance from logistic regression, using glmer() for mixed-effects logistic regression.

I would recommend against: the advisor's "solution", using lm() for logistic regression, and weighting rows equally (you should weight by N_indiv).

Generalized linear mixed models are tough. http://glmm.wikidot.com/faq has some good information - including the fact that you need many levels of a random factor in order to estimate its variance component.

My code below requires the lme4 package and the data from your link.

# Seroprevalance has been rounded, that's not OK
# to do logistic regression, (proportion * weight) must equal an integer
prev$seroexact <- round(prev$Seroprevalence * prev$N_indiv)/prev$N_indiv

# Host.Species is nested within Social.system, but you didn't reuse 
# species letters between Social.systems, so you can specify 
# Host.Species as a random effect without explicitly nesting it

# First random effect model
prev1.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species),
                  family=binomial(link="logit"), weights=N_indiv, data = prev)
summary(prev1.glmer)

## Fixed effects:
# Intercept is pathogen A and social.system A.  
# The z-test of the intercept is testing if the logit=0
# I.e. it's testing whether the combination of
# pathogen A and social.system A has prob=0.5.
# The other z-tests are testing whether other levels of the factors
# yield different probabilities than pathogen A and social.system A

## Random effects:
# This doesn't give you separate Host.Species and residual variances,
# Host.Species is treated as a random effect, so this model is the same as if
# you had summed the results of all studies with identical values of
# Host.Species, Pathogen, and Social.System. I.e. sum the results of the
# first 8 rows and create a single proportion and N_indiv, like so:

prevsum<-aggregate(cbind(N_indiv, prop=(seroexact*N_indiv)) ~ 
                   Social.System+Host.Species+Pathogen, data=prev, sum)
prevsum$prop<-prevsum$prop/prevsum$N_indiv

# which gives the same model:
prevsum.glmer = glmer(prop ~ Pathogen + Social.System + (1|Host.Species),
                      family=binomial(link="logit"), weights=N_indiv, data = prevsum)
summary(prevsum.glmer)

# So why are they broken up into multiple rows?  If each row represents
# one geographic area/time/litter/study/etc. then animals in one row
# might be more similar to eachother than they are to animals in
# another row that has the same values of Social, Species, & Pathogen.
# I think this is what the advisor wants as a "residual".

# To allow a random component for each row:
prev2<-cbind(resid=paste("Row_", row.names(prev), sep=""), prev)

prev2.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                   family=binomial(link="logit"), weights=N_indiv, data = prev2)
summary(prev2.glmer)

# This isn't a bad start, but I'm not comfortable with it because:
table(prev2[,2:3])

# Social.Sytstem D is only observed in Species F.
# This is called confounding, and it makes it hard to draw conclusions
# about Social Sytstem D.  How do you know what is caused by social
# system D and what is caused by species F?  If your friend really wants to
# make inferences about Social System D, she should collect data from
# another host species that uses Social System D.

# Leave out Soc_D:
prev3.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev3.glmer)

# Even though Host Species is conceptually a random factor, you really need to observe
# more than 2 species per social system for a mixed model to accurately estimate
# the species variance.  As far as species variance is concerned, each species is a
# single sample (not animals or even litters), and you can't hope to estimate variance
# accurately with only two samples.

# We can fit the model with species as a fixed effect, but we don't have
# enough degrees of freedom to estimate all levels of Species:
prev4.glmer = glmer(seroexact ~ Pathogen + Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# Your friend doesn't need to estimate the level of each species in order to test
# whether species has any noticeable effect at all.  Unfortunately, we can't just
# Use the F statistic from anova() because calculating the denominator df for a
# GLMM is not straightforward.
anova(prev4.glmer) #Gives you an F statistic, but no denominator df or p-value

# Instead we fit a simpler model without Species:
prev5.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# And we'll compare the two models With a Likelihood-Ratio test using anova()
anova(prev5.glmer,prev4.glmer)

# With a p-value of 0.01331 we can say it's worth keeping Species in the model.

# Now let's check the pathogen * social system interaction:
prev6.glmer = glmer(seroexact ~ Pathogen * Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, nAGQ=2, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev6.glmer) #Neither interaction term is significant
anova(prev6.glmer)
# We don't need a denominator df to know that the F statistic of 0.0774 for
# the interaction is insignificant.

# Since the interaction between Pathogen and Social System was not significant,
# we don't need to include the interaction term.  Similarly, I don't see a 
# statistical reason to  split the model into two separate 'pathogen specific'
# models, but maybe there's a scientific reason to do so:

# Separate tests for each pathogen:
prev7A.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_A",])
summary(prev7A.glmer)
# Social System B looks different from Social System A in pathogen A prevalance:

# Calculate the odds of having Pathogen A for Social System A vs B
beta7A<-fixef(prev7A.glmer)
exp(-beta7A[2]) #negative sign means odds of A:B instead of B:A
# So animals with Social System A have about 25 times the odds of
# animals with social system B of having Pathogen A

# Test for Pathogen B:
prev7B.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_B",])
summary(prev7B.glmer)
# The only significant effects are species-specific, which are not of interest

# Let's return to prev4.glmer, which models both pathogens:
summary(prev4.glmer)

# The only significant fixed effect in prev4.glmer is Pathogen.
beta4<-fixef(prev4.glmer)

# For a randomly selected animal, the odds of having Pathogen B to having Pathogen A are:
exp(beta4[2])

# That's about as much as you can interpret with the data she has.

# To answer the Advisor's request for variance components:
# Residual variance is:
getME(prev4.glmer, "theta")^2

# You can't do a good job of estimating species variance with these data.
# If her advisor won't listen, then you can tell him that your estimate is:
getME(prev3.glmer, "theta")[2]^2
# But it's a really crappy estimate.

# There is no such thing as a variance component for Social System because
# it's a fixed effect.  But you can get its sum of squares:
anova(prev4.glmer)

Related Solutions

Solved – Nested ANOVA, three way ANOVA, mixed model, or ANCOVA

The crossed, non-nested model is incorrect, as you suspect, and seems to be over-specified for your posted data. It's important to get the nesting correct, which can be confusing. This page is a good guide. With multiple individuals per site and each individual restricted to one site, the individuals are nested within site. If you were to use your anova(lm()) approach you should be writing:

total_conc ~ tissue/site/indiv

which is expanded to:

total_conc ~ 1 + tissue + tissue:site + tissue:site:indiv

So this model examines the main effect of tissue plus its interactions with site and with site:indiv, without main effects for site or indiv. The result of your anova(lm()) on your posted data* would then be:

Analysis of Variance Table

Response: total_conc
                 Df     Sum Sq   Mean Sq  F value  Pr(>F)  
tissue            4 3023661191 755915298 1288.243 0.02089 *
tissue:site      12 1280847591 106737299  181.903 0.05788 .
tissue:site:indv 95 1325836886  13956178   23.784 0.16203  
Residuals         1     586780    586780

You only have 1 residual degree of freedom because most cases have only one measurement on each tissue from an individual. So the three-way interaction is fairly meaningless. Although results from the Type I sums of squares used by anova(), when applied to unbalanced data, can depend on the order of entry of variables into the model formula, in this case there really is only one order of entry that makes sense, so that's not such an issue.

These data are highly unbalanced, with observations on fruit restricted to 2 indiv at one particular site. If that's characteristic of your real data, then you probably should consider removing such tissues with low numbers of observations from your analyses. Would you (or a skeptical reviewer) really trust analyses based on 2 observations from a single site? Removing the fruit observations from these data, at least, would leave a reasonably balanced design that would probably be good enough for standard nested ANOVA.

Note that both the nested ANOVA approach and mixed models with intercept-only terms like (1|indiv) are making an implicit assumption that all effects of individuals and sites on total_conc are additive without regard to the tissue being evaluated. You must use your knowledge of the subject matter as to the validity of that assumption. With the raw means of total_conc ranging from 2804 (pollen) to 16844 (fruit) I would be a bit worried about that assumption. In principle the formalism of a mixed model could incorporate random effects that differ among tissue values, but you don't seem to have enough data for that. The suggestion from @MarkWhite to try a Bayesian multilevel model might help, but I have no experience with such models.

If there is some irreducible imbalance in the data then I think that many would argue for a mixed-model approach. My guess is that the errors that you discover with mixed models due to small numbers of observations would also be seen in the anova(lm()) approach if you dug more deeply. For example, that approach on the crossed model applied to your data provides what seems to be a perfectly reasonable ANOVA table.

> anova(lm(total_conc ~ tissue+site+indv, data=data))
Analysis of Variance Table

Response: total_conc
          Df     Sum Sq   Mean Sq F value    Pr(>F)    
tissue     4 3023661191 755915298  47.967 < 2.2e-16 ***
site       3  746832764 248944255  15.797 3.552e-08 ***
indv      24  583953341  24331389   1.544   0.07733 .  
Residuals 81 1276485152  15759076

Nevertheless, the underlying crossed model can't define 3 regression coefficients because of singularities:

> summary(lm(total_conc ~ tissue+site+indv, data=data))

Call:
lm(formula = total_conc ~ tissue + site + indv, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9199.3 -2226.1   -84.8  1890.3  9891.5 

Coefficients: (3 not defined because of singularities)

*Note that the indiv values in the posted data need to be converted to factors first.

Best Answer

Related Solutions

Solved – Nested ANOVA, three way ANOVA, mixed model, or ANCOVA

Related Question