R Mixed Model Selection – Questions on Specifying Linear Mixed Models in R for Repeated Measures with Additional Nesting

lme4-nlmemixed modelmodel selectionrrepeated measures

Data Structure

> str(data)
 'data.frame':   6138 obs. of  10 variables:
 $ RT     : int  484 391 422 516 563 531 406 500 516 578 ...
 $ ASCORE : num  5.1 4 3.8 2.6 2.7 6.5 4.9 2.9 2.6 7.2 ...
 $ HSCORE : num  6 2.1 7.9 1 6.9 8.9 8.2 3.6 1.7 8.6 ...
 $ MVMNT  : Factor w/ 2 levels "_Withd","Appr": 2 2 1 1 2 1 2 1 1 2 ...
 $ STIM   : Factor w/ 123 levels " arti"," cele",..: 16 23 82 42 105 4 93 9 34 25 ...
 $ DRUG   : Factor w/ 2 levels "Inactive","Pharm": 1 1 1 1 1 1 1 1 1 1 ...
 $ FULLNSS: Factor w/ 2 levels "Fasted","Fed": 2 2 2 2 2 2 2 2 2 2 ...
 $ PATIENT: Factor w/ 25 levels "Subj01","Subj02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SESSION: Factor w/ 4 levels "Sess1","Sess2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ TRIAL  : Factor w/ 6138 levels "T0001","T0002",..: 1 2 3 4 5 6 7 8 9 10 ...

Full Model Candidate

model.loaded.fit <- lmer(RT ~ ASCORE*HSCORE*MVMNT*DRUG*FULLNSS
                              + (1|PATIENT) + (1|SESSION), data, REML = TRUE)
  • Reaction times from trials are clustered within sessions, which in turn are clustered within patients
  • Each trial can be characterized by two continuous covariates of ASCORE and HSCORE (ranging between 1-9) and by a movement response (withdraw or approach)
  • Sessions are characterized by drug intake (placebo or active pharmacon) and by fullness (fasted or pre-fed)

Modeling and R Syntax?

I'm trying to specify an appropriate full model with a loaded mean structure that can be used as a starting point in a top-down model selection strategy.

Specific issues:

  • Is the syntax correctly specifying the clustering and random effects?
  • Beyond syntax, is this model appropriate for the above within-subject design?
  • Should the full model specify all interactions of fixed effects, or only the ones that I am really interested in?
  • I have not included the STIM factor in the model, which characterizes the specific stimulus type used in a trial, but which I am not interested to estimate in any way – should I specify that as a random factor given it has 123 levels and very few data points per stimulus type?

Best Answer

I will answer each of your queries in turn.

Is the syntax correctly specifying the clustering and random effects?

The model you've fit here is, in mathematical terms, the model

$$ Y_{ijk} = {\bf X}_{ijk} {\boldsymbol \beta} + \eta_{i} + \theta_{ij} + \varepsilon_{ijk}$$

where

  • $Y_{ijk}$ is the reaction time for observation $k$ during session $j$ on individual $i$.

  • ${\bf X}_{ijk}$ is the predictor vector for observation $k$ during session $j$ on individual $i$ (in the model you've written up, this is comprised of all main effects and all interactions).

  • $\eta_i$ is the person $i$ random effect that induces correlation between observations made on the same person. $\theta_{ij}$ is the random effect for individual $i$'s session $j$ and $\varepsilon_{ijk}$ is the leftover error term.

  • ${\boldsymbol \beta}$ is the regression coefficient vector.

As noted on page 14-15 here this model is correct for specifying that sessions are nested within individuals, which is the case from your description.

Beyond syntax, is this model appropriate for the above within-subject design?

I think this model is reasonable, as it does respect the nesting structure in the data and I do think that individual and session are reasonably envisioned as random effects, as this model asserts. You should look at the relationships between the predictors and the response with scatterplots, etc. to ensure that the linear predictor (${\bf X}_{ijk} {\boldsymbol \beta}$) is correctly specified. The other standard regression diagnostics should possibly be examined as well.

Should the full model specify all interactions of fixed effects, or only the ones that I am really interested in?

I think starting with such a heavily saturated model may not be a great idea, unless it makes sense substantively. As I said in a comment, this will tend to overfit your particular data set and may make your results less generalizable. Regarding model selection, if you do start with the completely saturated model and do backwards selection (which some people on this site, with good reason, object to) then you have to make sure to respect the hierarchy in the model. That is, if you eliminate a lower level interaction from the model, then you should also delete all higher level interactions involving that variable. For more discussion on that, see the linked thread.

I have not included the STIM factor in the model, which characterizes the specific stimulus type used in a trial, but which I am not interested to estimate in any way - should I specify that as a random factor given it has 123 levels and very few data points per stimulus type?

Admittedly not knowing anything about the application (so take this with a grain of salt), that sounds like a fixed effect, not a random effect. That is, the treatment type sounds like a variable that would correspond to a fixed shift in the mean response, not something that would induce correlation between subjects who had the same stimulus type. But, the fact that it's a 123 level factor makes it cumbersome to enter into the model. I suppose I'd want to know how large of an effect you'd expect this to have. Regardless of the size of the effect, it will not induce bias in your slope estimates since this is a linear model, but leaving it out may make your standard errors larger than they would otherwise be.

Related Question