Mixed Effect Model – When and How to Use It

linearmixed modelrandom-effects-modelregression

Linear Mixed Effects Models are Extensions of Linear Regression models for data that are collected and summarized in groups. The key advantages is the coefficients can vary with respect to one or more group variables.

However, I am struggling with when to use mixed effect model? I will elaborate my questions by using a toy example with extreme cases.

Let's assume we want to model height and weight for animals and we use species as grouping variable.

If different group / species are really different. Say a dog and elephant. I think there is no point of using mixed effect model, we should build a model for each group.
If different group / species are really similar. Say a female dog and a male dog. I think we may want use gender as a categorical variable in the model.

So, I assume we should use mixed effect model in the middle cases? Say, the group are cat, dog, rabbit, they are similar sized animals but different.

Is there any formal argument to suggest when to use mixed effect model, i.e., how to draw lines among

Building models for each group
Mixed effect model
Use group as a categorical variable in regression

My attempt: Method 1 is the most "complex model" / less degree of freedom and method 3 is the most "simple model" / more degree of freedom. And Mixed effect model is in the middle. We may consider how much data and how complicated data we have to select the right model according to Bais Variance Trade Off.

Best Answer

I'm afraid I might have the nuanced and perhaps unsatisfying answer that it is a subjective choice by the researcher or data analyst. As mentioned elsewhere in this thread, it isn't enough to simply say the data have a "nested structure." To be fair, though, this is how many books describe when to use multilevel models. For example, I just pulled Joop Hox's book Multilevel Analysis off of my bookshelf, which gives this definition:

A multilevel problem concerns a population with a hierarchical structure.

Even in a pretty good textbook, the initial definition seems to be circular. I think this is partially due to the subjectivity of determining when to use what kind of model (including a multilevel model).

Another book, West, Welch, & Galecki's Linear Mixed Models says these models are for:

outcome variables in which the residuals are normally distributed but may not be independent or have constant variance. Study designs leading to data sets that may be appropriately analyzed using LMMs include (1) studies with clustered data, such as students in classrooms, or experimental designs with random blocks, such as batches of raw material for an industrial process, and (2) longitudinal or repeated-measures studies, in which subjects are measured repeatedly over time or under different conditions.

Finch, Bolin, & Kelley's Multilevel Modeling in R also talks about violating the iid assumption and correlated residuals:

Of particular importance in the context of multilevel modeling is the assumption [in standard regression] of independently distributed error terms for the individual observations within a sample. This assumption essentially means that there are no relationships among individuals in the sample for the dependent variable once the independent variables in the analysis are accounted for.

I believe that a multilevel model makes sense when there is reason to believe that observations are not necessarily independent of one another. Whatever "cluster" accounts for this non-independence can be modeled.

An obvious example would be children in classrooms—they are all interacting with one another, which might lead their test scores to be non-independent. What if one classroom has someone that asks a question that leads to material being covered in that class that isn't covered in other classes? What if the teacher is more awake for some classes than others? In this case, there would be some non-independence of data; in multilevel words, we could expect some variance in the dependent variable to be due to the cluster (i.e., class).

Your example of a dog versus an elephant depends on the independent and dependent variables of interest, I think. For example, let's say we are asking if there is an effect of caffeine on activity level. Animals from all over the zoo are randomly assigned to either get a caffeinated drink or a control drink.

If we are a researcher that is interested in caffeine, we might specify a multilevel model, because we really care about the effect of caffeine. This model would be specified as:

activity ~ condition + (1+condition|species)

This is particularly helpful if there are a large number of species we are testing this hypothesis over. However, a researcher might be interested in the species-specific effects of caffeine. In that case, they could specify species as a fixed effect:

activity ~ condition + species + condition*species

This obviously is a problem if there are, say, 30 species, creating an unwieldy 2 x 30 design. However, you can get pretty creative with how one models these relationships.

For example, some researchers are arguing for an even wider use of multilevel modeling. Gelman, Hill, & Yajima (2012) argue that multilevel modeling could be used as a correction for multiple comparisons—even in experimental research where the structure of the data is not obviously hierarchical in nature:

Harder problems arise when modeling multiple comparisons that have more structure. For example, suppose we have five outcome measures, three varieties of treatments, and subgroups classified by two sexes and four racial groups. We would not want to model this 2 × 3 × 4 × 5 structure as 120 exchangeable groups. Even in these more complex situations, we think multilevel modeling should and will eventually take the place of classical multiple comparisons procedures.

Problems can be modeled in various ways, and in ambiguous cases, multiple approaches might seem appealing. I think our job is to choose a reasonable, informed approach and do so transparently.

Related Solutions

Solved – linear regression vs linear mixed effect model coefficients

I don't know that I can give a rigorous theoretical explanation, but a picture may make things clearer:

The blue line is the OLS fit, the gray line is the population-level prediction for the mixed model. The individual lines are predicted lines (all equal slopes, randomly varying intercepts) for each ID.
Since there is some correlation between the mean values of X and Y for each group, some of the variability that would go into the slope is instead taken out by the random intercept term.
The apparently large difference in the intercepts is partly caused by extrapolation (the data starts at X=2, the intercept refers to the expected value at X=0).

d <- data.frame(ID=factor(rep(1:20,each=3)),
                Y=c(1,2,3,5,4,6,7,8,9,2,3,4,5,5,6,7,6,
                    8,3,4,2,1,2,
                    1,5,6,4,7,8,9,8,8,7,6,4,
                    2,4,5,6,6,7,5,3,4,2,1,2,
                    3,4,2,3,5,6,4,7,8,6,9,8,9),
                X=c(3,4,3,6,4,6,6,8,5.5,4,3,5.5,5,7,5.5,7,4.5,6,4,
                    3,4,2.5,4,3,6,6,6.5,7,8,7,7,5.5,6,6.5,4,4,3.5,
                    5,4,5.5,7,4.5,4.5,6,5.5,2,3,6,3,4.5,3,5,6,3,
                    7.5,7.5,5.5,6.5,7,6))

lm1 <- lm(Y ~ X, data = d)
library(lme4)
lmer1 <- lmer(Y ~ X + (1 | ID), data = d)
ff <- fixef(lmer1)
## get predictions
pp <- d
pp$Y <- predict(lmer1)
library(dplyr)
pp <- pp %>%
    group_by(ID) %>%
    filter(Y %in% range(Y))

library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(X,Y,colour=ID))+
    geom_point()+
    scale_colour_discrete(guide=FALSE)+
    geom_line(data=pp)+
    scale_x_continuous(limits=c(0,8))+
    geom_smooth(method="lm",aes(group=1),fullrange=TRUE)+
    geom_abline(slope=ff["X"],intercept=ff["(Intercept)"],
                colour="darkgray",lwd=1.5)
ggsave("CV161703.png")

Best Answer

Related Solutions

Solved – linear regression vs linear mixed effect model coefficients

Related Question