Solved – Unbalanced linear mixed effect modeling for longitudinal data with lme4

lme4-nlmemixed modelpanel datarrepeated measures

I'm new to longitudinal analyses, and I'm having trouble formulating a model that accurately reflects my study design. This study recruited subjects for two groups (dx vs. control), with measurements taken for each subject at three different timepoints. Age at baseline varied from subject to subject; from what I've read, this means the design is "unbalanced."

My data frame is organized such that if row n reads:

[subject_ID = x, baseline_age = y, Group = 1, Timepoint = 1, DV = k]

then row n+1 reads:

[subject_ID = x, baseline_age = y, Group = 1, Timepoint = 2, DV = m]

I'm interested in relationships between baseline_age, timepoint, and Group. If baseline_age weren't a factor, I think the R code would be as follows:

mod1 <- lmer(DV ~ timepoint + Group + timepoint:Group + (timepoint|subject_ID), 
             data = mydat) 

where (timepoint | subject_ID) reflects the fact that timepoint varies at the individual level. However, assuming the above is correct, my confusion arises when I try to model random effects with baseline_age entered into the equation. Since baseline_age and subject_ID are perfectly correlated, would it be possible to use baseline_age as proxy for subject_ID in lmer? Or should I model a second random effect? Specifically, I'm considering the following three-way interaction model:

mod2 <- lmer(DV ~ timepoint + Group + baseline_age + timepoint:Group + 
                  timepoint:baseline_age + Group:baseline_age + 
                  timepoint:Group:baseline_age + (timepoint | baseline_age), 
             data = my dat)

Best Answer

Your second model does not make sense because observations are not grouped/clustered/nested within baseline_age.

The first model does make sense because observations are grouped/nested/clustered within subject_ID (because you have repeated measures). There is no further clustering as far as I can gather from the description.

So, a good initial model would be

mod2 <- lmer(DV ~ timepoint + Group + baseline_age + (1 | subject_ID), 
         data = mydat)

The coefficient for timepoint will provide the linear growth estimate(s) (whether it is coded as a factor or numeric matters, see below), the coefficient for Group will give the treatment effect, while controlling for differential baseline ages, while the random intercept for subject_ID will allow each subject's intercept (at the study inception point) to vary. You could include interactions of the fixed effects, at this stage, if this makes sense to your research question.

Subsequent to this, you might want to include one or more of the fixed effects variables as random coefficients (slopes) if the effects of these are thought to vary between subject, such as timepoint that you mention. Note that this is not the same as your statement in the question: "where (timepoint | subject_ID) reflects the fact that timepoint varies at the individual level". Obviously timepoint varies at the individual level, because you have repeated measures. The random intercept deals with this, but it assumes that the slope (coefficient) for it is the same for each subject. If you have reason to believe that the slopes should vary between subjects then you can include one or more on the left side of the | symbol). However, if timepoint is a factor then this will result in a separate random effect for each level, which will increase the computational burden an possibly cause numerical problems (that is, assuming you have enough observations to make such a model identifiable to begin with), as well as making the model interpretation more complex.

Also note that if timepoint is a factor then you will get a seperate fixed effect estimate for each level too, which may or may not be what you want (in my experience it is not usually what you want unless the number of levels is small). So, if it isn't already, you might also want to consider coding timepoint as a numeric variable, this will then give you a single fixed main effect. This will model linear growth in your outcome/response, giving each subject their own intercept. If you also add timepoint as a random slope, then you can allow each subject to have their own slope. If you want to cater for non-linear growth then you could add a quadratic variable for timepoint (centering it first to avoid collinearity).