R Mixed Model Selection – Questions on Specifying Linear Mixed Models in R for Repeated Measures with Additional Nesting

lme4-nlmemixed modelmodel selectionrrepeated measures

Data Structure

> str(data)
 'data.frame':   6138 obs. of  10 variables:
 $ RT     : int  484 391 422 516 563 531 406 500 516 578 ...
 $ ASCORE : num  5.1 4 3.8 2.6 2.7 6.5 4.9 2.9 2.6 7.2 ...
 $ HSCORE : num  6 2.1 7.9 1 6.9 8.9 8.2 3.6 1.7 8.6 ...
 $ MVMNT  : Factor w/ 2 levels "_Withd","Appr": 2 2 1 1 2 1 2 1 1 2 ...
 $ STIM   : Factor w/ 123 levels " arti"," cele",..: 16 23 82 42 105 4 93 9 34 25 ...
 $ DRUG   : Factor w/ 2 levels "Inactive","Pharm": 1 1 1 1 1 1 1 1 1 1 ...
 $ FULLNSS: Factor w/ 2 levels "Fasted","Fed": 2 2 2 2 2 2 2 2 2 2 ...
 $ PATIENT: Factor w/ 25 levels "Subj01","Subj02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SESSION: Factor w/ 4 levels "Sess1","Sess2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ TRIAL  : Factor w/ 6138 levels "T0001","T0002",..: 1 2 3 4 5 6 7 8 9 10 ...

Full Model Candidate

model.loaded.fit <- lmer(RT ~ ASCORE*HSCORE*MVMNT*DRUG*FULLNSS
                              + (1|PATIENT) + (1|SESSION), data, REML = TRUE)

Reaction times from trials are clustered within sessions, which in turn are clustered within patients
Each trial can be characterized by two continuous covariates of ASCORE and HSCORE (ranging between 1-9) and by a movement response (withdraw or approach)
Sessions are characterized by drug intake (placebo or active pharmacon) and by fullness (fasted or pre-fed)

Modeling and R Syntax?

I'm trying to specify an appropriate full model with a loaded mean structure that can be used as a starting point in a top-down model selection strategy.

Specific issues:

Is the syntax correctly specifying the clustering and random effects?
Beyond syntax, is this model appropriate for the above within-subject design?
Should the full model specify all interactions of fixed effects, or only the ones that I am really interested in?
I have not included the STIM factor in the model, which characterizes the specific stimulus type used in a trial, but which I am not interested to estimate in any way – should I specify that as a random factor given it has 123 levels and very few data points per stimulus type?

Best Answer

I will answer each of your queries in turn.

Is the syntax correctly specifying the clustering and random effects?

The model you've fit here is, in mathematical terms, the model

$$ Y_{ijk} = {\bf X}_{ijk} {\boldsymbol \beta} + \eta_{i} + \theta_{ij} + \varepsilon_{ijk}$$

where

$Y_{ijk}$ is the reaction time for observation $k$ during session $j$ on individual $i$.
${\bf X}_{ijk}$ is the predictor vector for observation $k$ during session $j$ on individual $i$ (in the model you've written up, this is comprised of all main effects and all interactions).
$\eta_i$ is the person $i$ random effect that induces correlation between observations made on the same person. $\theta_{ij}$ is the random effect for individual $i$'s session $j$ and $\varepsilon_{ijk}$ is the leftover error term.
${\boldsymbol \beta}$ is the regression coefficient vector.

As noted on page 14-15 here this model is correct for specifying that sessions are nested within individuals, which is the case from your description.

Beyond syntax, is this model appropriate for the above within-subject design?

I think this model is reasonable, as it does respect the nesting structure in the data and I do think that individual and session are reasonably envisioned as random effects, as this model asserts. You should look at the relationships between the predictors and the response with scatterplots, etc. to ensure that the linear predictor (${\bf X}_{ijk} {\boldsymbol \beta}$) is correctly specified. The other standard regression diagnostics should possibly be examined as well.

Should the full model specify all interactions of fixed effects, or only the ones that I am really interested in?

I think starting with such a heavily saturated model may not be a great idea, unless it makes sense substantively. As I said in a comment, this will tend to overfit your particular data set and may make your results less generalizable. Regarding model selection, if you do start with the completely saturated model and do backwards selection (which some people on this site, with good reason, object to) then you have to make sure to respect the hierarchy in the model. That is, if you eliminate a lower level interaction from the model, then you should also delete all higher level interactions involving that variable. For more discussion on that, see the linked thread.

I have not included the STIM factor in the model, which characterizes the specific stimulus type used in a trial, but which I am not interested to estimate in any way - should I specify that as a random factor given it has 123 levels and very few data points per stimulus type?

Admittedly not knowing anything about the application (so take this with a grain of salt), that sounds like a fixed effect, not a random effect. That is, the treatment type sounds like a variable that would correspond to a fixed shift in the mean response, not something that would induce correlation between subjects who had the same stimulus type. But, the fact that it's a 123 level factor makes it cumbersome to enter into the model. I suppose I'd want to know how large of an effect you'd expect this to have. Regardless of the size of the effect, it will not induce bias in your slope estimates since this is a linear model, but leaving it out may make your standard errors larger than they would otherwise be.

Related Solutions

Solved – Specifying a linear mixed model in lmer with replications nested within a fully crossed design

Since each subj_id is associated with one and only one value of sex, you need not mention sex in the random part of the model formula. The model using maximal random effects would be:

(1 + method*param1*param2 | subj_id)

You might have some difficulty estimating the random effects for this model because of the large number of levels of your factors (which implies a very large correlation matrix).

Also, I can't think of any reason why using difference scores instead of the raw pairs of scores as your DV would cause a problem.

Solved – Repeated-measures linear mixed effect model

okay should work out okay then-- so

Yes or you can use the lmer() and lme4. There is another one but I don't remember off the top of my head I think it is just lme?

2.You have a nested structure so yes you need (1|sample/participant)

Did you plot rating score vs stim.level to see evidence of a quadratic relationship? If not try plotting to see-- i you do see a quadratic pattern then yes you should add stim.level quadratic effect by

model <- lmer (rating.score ~ stim.level + I(stim.level^2) factor + stim.level*factor +(1|sample/participant) , mydata)

To reply to the comment

so if you are fitting a parabola and not a line you are fitting the generic

 y= a + b*x + c*x^2

so you need the linear and quadratic term so stim.level is the linear term and I(stim.level^2) would be the quadratic term. Try writing your model out on paper in equation form like

rating score = b_0 + b_1 * stim.level + b_2 *stim.level^2 + b.3*(stim.level*factor)

and if wanted you could have an interaction with the squared term but that might not be directly interpretable. 2) Remember you are fitting a parabola NOT a line http://biopt.ub.edu/_/rsrc/1257372769717/force-detection/equipartition-theorem/ex4%20X%20potential.jpg?height=315&width=420 something that looks like that. Did you plot stim.level against rating.score and see a quadratic relationship?