I'm new to longitudinal analyses, and I'm having trouble formulating a model that accurately reflects my study design. This study recruited subjects for two groups (dx vs. control), with measurements taken for each subject at three different timepoints. Age at baseline varied from subject to subject; from what I've read, this means the design is "unbalanced."
My data frame is organized such that if row n reads:
[subject_ID = x, baseline_age = y, Group = 1, Timepoint = 1, DV = k]
then row n+1 reads:
[subject_ID = x, baseline_age = y, Group = 1, Timepoint = 2, DV = m]
I'm interested in relationships between baseline_age
, timepoint
, and Group
. If baseline_age
weren't a factor, I think the R code would be as follows:
mod1 <- lmer(DV ~ timepoint + Group + timepoint:Group + (timepoint|subject_ID),
data = mydat)
where (timepoint | subject_ID)
reflects the fact that timepoint
varies at the individual level. However, assuming the above is correct, my confusion arises when I try to model random effects with baseline_age
entered into the equation. Since baseline_age
and subject_ID
are perfectly correlated, would it be possible to use baseline_age
as proxy for subject_ID
in lmer
? Or should I model a second random effect? Specifically, I'm considering the following three-way interaction model:
mod2 <- lmer(DV ~ timepoint + Group + baseline_age + timepoint:Group +
timepoint:baseline_age + Group:baseline_age +
timepoint:Group:baseline_age + (timepoint | baseline_age),
data = my dat)
Best Answer
Your second model does not make sense because observations are not grouped/clustered/nested within
baseline_age
.The first model does make sense because observations are grouped/nested/clustered within
subject_ID
(because you have repeated measures). There is no further clustering as far as I can gather from the description.So, a good initial model would be
The coefficient for
timepoint
will provide the linear growth estimate(s) (whether it is coded as a factor or numeric matters, see below), the coefficient forGroup
will give the treatment effect, while controlling for differential baseline ages, while the random intercept forsubject_ID
will allow each subject's intercept (at the study inception point) to vary. You could include interactions of the fixed effects, at this stage, if this makes sense to your research question.Subsequent to this, you might want to include one or more of the fixed effects variables as random coefficients (slopes) if the effects of these are thought to vary between subject, such as
timepoint
that you mention. Note that this is not the same as your statement in the question: "where(timepoint | subject_ID)
reflects the fact thattimepoint
varies at the individual level". Obviouslytimepoint
varies at the individual level, because you have repeated measures. The random intercept deals with this, but it assumes that the slope (coefficient) for it is the same for each subject. If you have reason to believe that the slopes should vary between subjects then you can include one or more on the left side of the|
symbol). However, iftimepoint
is a factor then this will result in a separate random effect for each level, which will increase the computational burden an possibly cause numerical problems (that is, assuming you have enough observations to make such a model identifiable to begin with), as well as making the model interpretation more complex.Also note that if
timepoint
is a factor then you will get a seperate fixed effect estimate for each level too, which may or may not be what you want (in my experience it is not usually what you want unless the number of levels is small). So, if it isn't already, you might also want to consider codingtimepoint
as a numeric variable, this will then give you a single fixed main effect. This will model linear growth in your outcome/response, giving each subject their own intercept. If you also addtimepoint
as a random slope, then you can allow each subject to have their own slope. If you want to cater for non-linear growth then you could add a quadratic variable for timepoint (centering it first to avoid collinearity).