R Mixed Model – Specifying an LME Model with More Than One Within-Subjects Factor

lme4-nlmemixed modelrrepeated measures

The data

Suppose we have a dataset d with two between-subject factors (i.e., groups), group and condition, and two within-subject factors (i.e., repeated-measures factors), topic and problem (I uploaded the data to pastebin, so everybody should be able to obtain it):

> d <- read.table(url("http://pastebin.com/raw.php?i=4hRFyaRj"), colClasses = c(rep("factor", 6), "numeric"))
> str(d)
'data.frame':   2928 obs. of  6 variables:
  $ code     : Factor w/ 183 levels "A03U","A08C",..: 1 1 1 1 1 1 1 1 1 1 ...
  $ group    : Factor w/ 2 levels "control","experimental": 2 2 2 2 2 2 2 2 2 2 ...
  $ condition: Factor w/ 3 levels "alternatives",..: 3 3 3 3 3 3 3 3 3 3 ...
  $ topic    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 2 3 3 ...
  $ problem  : Factor w/ 4 levels "AC","DA","MP",..: 3 4 1 2 3 4 1 2 3 4 ...
  $ mean     : num  94.5 94.5 86.5 84.5 80 46.5 73.5 43.5 51 39 ...

The data is from a behavioral experiment in which participants in six groups (2 levels of group times 3 levels of condition) worked on 16 tasks (for each of 4 topics 4 different problems). Allocation of participants to group/condition was fully random. Presentation of tasks was random insofar that problem was blocked within topic (i.e., for each topic all problems where presented sequentially), but order of problem and topic was random.
Update: The factor identifying the participant (in which topic and problem are nested) is code.

The Problem

How can I fit this dataset using lme?
(Sidenote: I would also consider using lme4, but I am kind of afraid of not having p-values, if there is something easily digestible as p-values, I would also consider lme4 an option).

So far I managed to fit an lme model with only one within-subject factor, but not two (see below).

What I tried

I can fit an lme model if I have just one within-subject factor:

require(nlme)
 m1 <- lme(mean ~ condition*group*problem, random = ~1|code/problem, 
           data = d, subset = topic == "1")

anova(m1)
                        numDF denDF F-value p-value
(Intercept)                 1   531   12101  <.0001
condition                   2   177      31  <.0001
group                       1   177       2  0.2178
problem                     3   531      35  <.0001
condition:group             2   177       1  0.3672
condition:problem           6   531      24  <.0001
group:problem               3   531       1  0.2180
condition:group:problem     6   531       2  0.0281

This (especially the df) nicely correspond with the results from an standard ANOVA (using
ez):

require(ez)
ezANOVA(subset(d, topic == "1"), dv = .(mean), wid = .(code), between = .(condition, group), within = .(problem))$ANOVA

Warning: Data is unbalanced (unequal N per group). Make sure you specified a well-considered value for the type argument to ezANOVA().
                   Effect DFn DFd     F                             p p<.05     ges
2               condition   2 177 30.69 0.000000000003611248905859672     * 0.13079
3                   group   1 177  1.53 0.217821969825403999321267179       0.00374
5                 problem   3 531 34.85 0.000000000000000000014254103     * 0.10028
4         condition:group   2 177  1.01 0.367225806638525886782531416       0.00492
6       condition:problem   6 531 24.40 0.000000000000000000000000142     * 0.13503
7           group:problem   3 531  1.48 0.217959293081550348203379031       0.00472
8 condition:group:problem   6 531  2.38 0.028119961573665430004664856     * 0.01499

Trying to fit this data with two within-subject factors in lme fails (either per code, or per dfs):

m2 <- lme(mean ~ condition*group*problem*topic, random = ~1|code/(problem*topic), data = d)
# fails: Error in getGroups.data.frame(dataMix, groups) : 
#  Invalid formula for groups

m3 <- lme(mean ~ condition*group*problem*topic, random = ~1|code/problem/topic, data = d)
# the next model takes some time (probably already an indicator, that it is the wrong model)
# and produces wrong denominator df!

# with both factors as ANOVA
m4 <- ezANOVA(d, dv = .(mean), wid = .(code), between = .(condition, group), within = .(problem, topic))

#effects are the same
all(row.names(anova(m3))[-1] == m4$ANOVA$Effect)

#denominator dfs are not:
anova(m3)$denDF[-1] == m4$ANOVA$DFd

# only for effects with topic:
row.names(anova(m3))[-1][!(anova(m3)$denDF[-1] == m4$ANOVA$DFd)]

UPDATE: As the precise error or nesting is somewhat unclear I here provide the equivalent aov call (this is the "standard" model via aov), which matches the results from ezANOVA. The critical error term is Error(code/(problem*topic)):

m5 <- aov(mean ~ (condition*group*problem*topic) + Error(code/(problem*topic)), d)
summary(m5)

Best Answer

I found an answer to my question on this thread: Repeated measures ANOVA with lme in R for two within-subject factors (somehow this thread was already one of my favorites, I must have forgotten about it). The specification is a little unhandy.

m6 <- lme(mean ~ condition*group*problem*topic, 
   random = list(code=pdBlocked(list(~1, pdIdent(~problem-1), pdIdent(~topic-1)))), data = d)
anova(m6)

However, the denominator dfs are still wrong, as noted in the thread and apparent in comparisons between the ANOVA and lme dfs.

data.frame(effect = rownames(anova(m6)), denDf= anova(m6)$denDF)

m4$ANOVA[,c("Effect", "DFd")]

As long as there are no other ideas, I think I will need to do the analysis in lme4, for which I wil need to post another question.

Related Solutions

Paired t-test – Special Case of Linear Mixed-Effect Modeling

The equivalence of the models can be observed by calculating the correlation between two observations from the same individual, as follows:

As in your notation, let $Y_{ij} = \mu + \alpha_i + \beta_j + \epsilon_{ij}$, where $\beta_j \sim N(0, \sigma_p^2)$ and $\epsilon_{ij} \sim N(0, \sigma^2)$. Then $Cov(y_{ik}, y_{jk}) = Cov(\mu + \alpha_i + \beta_k + \epsilon_{ik}, \mu + \alpha_j + \beta_k + \epsilon_{jk}) = Cov(\beta_k, \beta_k) = \sigma_p^2$, because all other terms are independent or fixed, and $Var(y_{ik}) = Var(y_{jk}) = \sigma_p^2 + \sigma^2$, so the correlation is $\sigma_p^2/(\sigma_p^2 + \sigma^2)$.

Note that the models however are not quite equivalent as the random effect model forces the correlation to be positive. The CS model and the t-test/anova model do not.

EDIT: There are two other differences as well. First, the CS and random effect models assume normality for the random effect, but the t-test/anova model does not. Secondly, the CS and random effect models are fit using maximum likelihood, while the anova is fit using mean squares; when everything is balanced they will agree, but not necessarily in more complex situations. Finally, I'd be wary of using F/df/p values from the various fits as measures of how much the models agree; see Doug Bates's famous screed on df's for more details. (END EDIT)

The problem with your R code is that you're not specifying the correlation structure properly. You need to use gls with the corCompSymm correlation structure.

Generate data so that there is a subject effect:

set.seed(5)
x <- rnorm(10)
x1<-x+rnorm(10)
x2<-x+1 + rnorm(10)
myDat <- data.frame(c(x1,x2), c(rep("x1", 10), rep("x2", 10)), 
                    rep(paste("S", seq(1,10), sep=""), 2))
names(myDat) <- c("y", "x", "subj")

Then here's how you'd fit the random effects and the compound symmetry models.

library(nlme)
fm1 <- lme(y ~ x, random=~1|subj, data=myDat)
fm2 <- gls(y ~ x, correlation=corCompSymm(form=~1|subj), data=myDat)

The standard errors from the random effects model are:

m1.varp <- 0.5453527^2
m1.vare <- 1.084408^2

And the correlation and residual variance from the CS model is:

m2.rho <- 0.2018595
m2.var <- 1.213816^2

And they're equal to what is expected:

> m1.varp/(m1.varp+m1.vare)
[1] 0.2018594
> sqrt(m1.varp + m1.vare)
[1] 1.213816

Other correlation structures are usually not fit with random effects but simply by specifying the desired structure; one common exception is the AR(1) + random effect model, which has a random effect and AR(1) correlation between observations on the same random effect.

EDIT2: When I fit the three options, I get exactly the same results except that gls doesn't try to guess the df for the term of interest.

> summary(fm1)
...
Fixed effects: y ~ x 
                 Value Std.Error DF   t-value p-value
(Intercept) -0.5611156 0.3838423  9 -1.461839  0.1778
xx2          2.0772757 0.4849618  9  4.283380  0.0020

> summary(fm2)
...
                 Value Std.Error   t-value p-value
(Intercept) -0.5611156 0.3838423 -1.461839  0.1610
xx2          2.0772757 0.4849618  4.283380  0.0004

> m1 <- lm(y~ x + subj, data=myDat)
> summary(m1)
...
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.3154     0.8042  -0.392  0.70403   
xx2           2.0773     0.4850   4.283  0.00204 **

(The intercept is different here because with the default coding, it's not the mean of all subjects but instead the mean of the first subject.)

It's also of interest to note that the newer lme4 package gives the same results but doesn't even try to compute a p-value.

> mm1 <- lmer(y ~ x + (1|subj), data=myDat)
> summary(mm1)
...
            Estimate Std. Error t value
(Intercept)  -0.5611     0.3838  -1.462
xx2           2.0773     0.4850   4.283

Solved – Post Hoc test for between subject factor in a repeated measures ANOVA in R

In R when you have any ANOVA (other than simple one-factor), you have to provide to TukeyHSD the variables to have intervals calculated.

summary(cc<-aov(weight~as.factor(Diet)+as.factor(Chick),data=ChickWeight))
TukeyHSD(cc,"as.factor(Diet)",data=ChickWeight)

Use ?TukeyHSD to get the detailed help.