I have some trouble obtaining equivalent results between an aov
between-within repeated measures model and an lmer
mixed model.
My data and script look as follows
data=read.csv("https://www.dropbox.com/s/zgle45tpyv5t781/fitness.csv?dl=1")
data$id=factor(data$id)
data
id FITNESS TEST PULSE
1 1 pilates CYCLING 91
2 2 pilates CYCLING 82
3 3 pilates CYCLING 65
4 4 pilates CYCLING 90
5 5 pilates CYCLING 79
6 6 pilates CYCLING 84
7 7 aerobics CYCLING 84
8 8 aerobics CYCLING 77
9 9 aerobics CYCLING 71
10 10 aerobics CYCLING 91
11 11 aerobics CYCLING 72
12 12 aerobics CYCLING 93
13 13 zumba CYCLING 63
14 14 zumba CYCLING 87
15 15 zumba CYCLING 67
16 16 zumba CYCLING 98
17 17 zumba CYCLING 63
18 18 zumba CYCLING 72
19 1 pilates JOGGING 136
20 2 pilates JOGGING 119
21 3 pilates JOGGING 126
22 4 pilates JOGGING 108
23 5 pilates JOGGING 122
24 6 pilates JOGGING 101
25 7 aerobics JOGGING 116
26 8 aerobics JOGGING 142
27 9 aerobics JOGGING 137
28 10 aerobics JOGGING 134
29 11 aerobics JOGGING 131
30 12 aerobics JOGGING 120
31 13 zumba JOGGING 99
32 14 zumba JOGGING 99
33 15 zumba JOGGING 98
34 16 zumba JOGGING 99
35 17 zumba JOGGING 87
36 18 zumba JOGGING 89
37 1 pilates SPRINTING 179
38 2 pilates SPRINTING 195
39 3 pilates SPRINTING 188
40 4 pilates SPRINTING 189
41 5 pilates SPRINTING 173
42 6 pilates SPRINTING 193
43 7 aerobics SPRINTING 184
44 8 aerobics SPRINTING 179
45 9 aerobics SPRINTING 179
46 10 aerobics SPRINTING 174
47 11 aerobics SPRINTING 164
48 12 aerobics SPRINTING 182
49 13 zumba SPRINTING 111
50 14 zumba SPRINTING 103
51 15 zumba SPRINTING 113
52 16 zumba SPRINTING 118
53 17 zumba SPRINTING 127
54 18 zumba SPRINTING 113
Basically, 3 x 6 subjects (id
) were subjected to three different FITNESS
workout schemes each and their PULSE
was measured after carrying out three different types of endurance TEST
s.
I then fitted the following aov
model :
library(afex)
library(car)
set_sum_contrasts()
fit1 = aov(PULSE ~ FITNESS*TEST + Error(id/TEST),data=data)
summary(fit1)
Error: id
Df Sum Sq Mean Sq F value Pr(>F)
FITNESS 2 14194 7097 115.1 7.92e-10 ***
Residuals 15 925 62
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: id:TEST
Df Sum Sq Mean Sq F value Pr(>F)
TEST 2 57459 28729 253.7 < 2e-16 ***
FITNESS:TEST 4 8200 2050 18.1 1.16e-07 ***
Residuals 30 3397 113
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The result I obtain using
set_sum_contrasts()
fit2=aov.car(PULSE ~ FITNESS*TEST+Error(id/TEST),data=data,type=3,return="Anova")
summary(fit2)
is identical to this.
A mixed model run using nlme
gives a directly equivalent result, e.g. using lme
:
library(lmerTest)
lme1=lme(PULSE ~ FITNESS*TEST, random=~1|id, correlation=corCompSymm(form=~1|id),data=data)
anova(lme1)
numDF denDF F-value p-value
(Intercept) 1 30 12136.126 <.0001
FITNESS 2 15 115.127 <.0001
TEST 2 30 253.694 <.0001
FITNESS:TEST 4 30 18.103 <.0001
summary(lme1)
Linear mixed-effects model fit by REML
Data: data
AIC BIC logLik
371.5375 393.2175 -173.7688
Random effects:
Formula: ~1 | id
(Intercept) Residual
StdDev: 1.699959 9.651662
Correlation Structure: Compound symmetry
Formula: ~1 | id
Parameter estimate(s):
Rho
-0.2156615
Fixed effects: PULSE ~ FITNESS * TEST
Value Std.Error DF t-value p-value
(Intercept) 81.33333 4.000926 30 20.328628 0.0000
FITNESSpilates 0.50000 5.658164 15 0.088368 0.9308
FITNESSzumba -6.33333 5.658164 15 -1.119327 0.2806
TESTJOGGING 48.66667 6.143952 30 7.921069 0.0000
TESTSPRINTING 95.66667 6.143952 30 15.570868 0.0000
FITNESSpilates:TESTJOGGING -11.83333 8.688861 30 -1.361897 0.1834
FITNESSzumba:TESTJOGGING -28.50000 8.688861 30 -3.280062 0.0026
FITNESSpilates:TESTSPRINTING 8.66667 8.688861 30 0.997446 0.3265
FITNESSzumba:TESTSPRINTING -56.50000 8.688861 30 -6.502579 0.0000
Or using gls
:
library(lmerTest)
gls1=gls(PULSE ~ FITNESS*TEST, correlation=corCompSymm(form=~1|id),data=data)
anova(gls1)
However, the result I obtain using lme4
's lmer
is different :
set_sum_contrasts()
fit3=lmer(PULSE ~ FITNESS*TEST+(1|id),data=data)
summary(fit3)
Linear mixed model fit by REML ['lmerMod']
Formula: PULSE ~ FITNESS * TEST + (1 | id)
Data: data
REML criterion at convergence: 362.4
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.00 0.0
Residual 96.04 9.8
...
Anova(fit3,test.statistic="F",type=3)
Analysis of Deviance Table (Type III Wald F tests with Kenward-Roger df)
Response: PULSE
F Df Df.res Pr(>F)
(Intercept) 7789.360 1 15 < 2.2e-16 ***
FITNESS 73.892 2 15 1.712e-08 ***
TEST 299.127 2 30 < 2.2e-16 ***
FITNESS:TEST 21.345 4 30 2.030e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Anybody any thoughts what I am doing wrong with the lmer
model? Or where the difference comes from? Could it have to do anything with lmer
not allowing negative intraclass corellations or something like that? Given that nlme
's gls
and lme
do return the correct result, though, I am wondering how this is different in gls
and lme
? Is it that the option correlation=corCompSymm(form=~1|id)
causes them to directly estimate the intraclass correlation, which can be either positive or negative, whereas lmer
estimates a variance component, which cannot be negative (and ends up being estimated as zero in this case)?
Best Answer
As Ben Bolker already mentioned in the comments, the problem is as you suspect: The
lmer()
model gets tripped up because it attempts to estimate a variance component model, with the variance component estimates constrained to be non-negative. What I will try to do is give a somewhat intuitive understanding of what it is about your dataset that leads to this, and why this causes a problem for variance component models.Here is a plot of your dataset. The white dots are the actual observations and the black dots are the subject means.
To make things more simple, but without changing the spirit of the problem, I will subtract out the fixed effects (i.e., the
FITNESS
andTEST
effects, as well as the grand mean) and deal with the residual data as a one-way random effects problem. So here's what the new dataset looks like:Look hard at the patterns in this plot. Think about how observations taken from the same subject differ from observations taken from different subjects. Specifically, notice the following pattern: As one of the observations for a subject is higher (or lower) above (or below) the subject mean, the other observations from that subject tend to be on the opposite side of the subject mean. And the further that observation is from the subject mean, the further the other observations tend to be from the subject mean on the opposite side. This indicates a negative intra-class correlation. Two observations taken from the same subject actually tend to be less similar, on average, than two observations drawn purely at random from the dataset.
Another way to think about this pattern is in terms of the relative magnitudes of the between-subject and within-subject variance. It appears that there is substantially greater within-subject variance compared to between-subject variance. Of course, we expect this to happen to some extent. After all, the within-subject variance is based on variation in the individual data points, while the between-subject variance is based on variation in means of the individual data points (i.e., the subject means), and we know that the variance of a mean will tend to decrease as the number of things being averaged increases. But in this dataset the difference is quite striking: There is way more within-subject than between-subject variation. Actually this difference is exactly the reason why a negative intra-class correlation emerges.
Okay, so here is the problem. The variance component model assumes that each data point is the sum of a subject effect and an error: $y_{ij}=u_j+e_{ij}$, where $u_j$ is the effect of the $j$th subject. So let's think about what would happen if there were truly 0 variance in the subject effects -- in other words, if the true between-subjects variance component were 0. Given an actual dataset generated under this model, if we were to compute sample means for each subject's observed data, those sample means would still have some non-zero variance, but they would reflect only error variance, and not any "true" subject variance (because we have assumed there is none).
So how variable would we expect these subject means to be? Well, basically each estimated subject effect is a mean, and we know the formula for the variance of a mean: $\text{var}(\bar{X})=\text{var}(X_i)/n$, where $n$ is the number of things being averaged. Now let's apply this formula to your dataset and see how much variance we would expect to see in the estimated subject effects if the true between-subjects variance component were exactly 0.
The within-subject variance works out to be $348$, and each subject effect is computed as the mean of 3 observations. So the expected standard deviation in the subject means -- assuming the true between-subject variance is 0 -- works out to be about $10.8$. Now compare this to the standard deviation in subject means that we actually observed: $4.3$! The observed variation is substantially less than the expected variation when we assumed 0 between-subject variance. For a variance component model, the only way that the observed variation could be expected to be as low as what we actually observed is if the true between-subject variance were somehow negative. And therein lies the problem. The data imply that there is somehow a negative variance component, but the software (sensibly) will not allow negative estimates of variance components, since a variance can in fact never be negative. The other models that you fit avoid this problem by directly estimating the intra-class correlation rather than assuming a simple variance component model.
If you want to see how you could actually get the negative variance component estimate implied by your dataset, you can use the procedure that I illustrate (with accompanying
R
code) in this other recent answer of mine. That procedure is not totally trivial, but not too hard either (for a balanced design such as this one).