I recently came across what I think may be a problem in how the anova()
function from the lmerTest
packages computes its F-statistics and corresponding P-values for fixed effects from mixed-effects models. Let me start by saying that I know of the controversy surrounding calculating P-values from mixed effects models (for reason discussed here). Nonetheless, many folks still want P-values and thus a number of ways have been developed to accommodate this (see here). Here I want to show the results of a commonly used approach — namely, the anova
function from the lmerTest
package — and hope that someone has an idea of why the results are not quite making sense.
First here is my data. I had to link to it because of its size. Note that the biomass column has been standardized (mean = 0, sd = 1), hence the negative values. This does not alter the output. Once downloaded and the working directory has been specified, the file can be read in as follows:
dat <- read.csv("StackOverflow_Data.csv", header = T)
Below is my model using the lmer
function from lme4
. In this model I have plant biomass as a response variable and three factors — A, B, and C — each with two levels, as predictors. Plant Genotype and spatial block are included as random effects.
model <- lmer(Biomass ~ A + B + C +
A:B + A:C +
B:C + A:B:C +
(1 | Genotype) + (1 | Block) ,
data = dat, REML = T)
Summarizing the above model using summary(model)
we get:
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: Biomass ~ A + B + C + A:B + A:C + B:C + A:B:C + (1 | Genotype) +
(1 | Block)
Data: dat
AIC BIC logLik deviance df.resid
1059.7 1111.0 -518.8 1037.7 776
Scaled residuals:
Min 1Q Median 3Q Max
-3.04330 -0.63914 0.00315 0.69108 2.82368
Random effects:
Groups Name Variance Std.Dev.
Genotype (Intercept) 0.07509 0.2740
Block (Intercept) 0.01037 0.1018
Residual 0.19038 0.4363
Number of obs: 787, groups: Genotype, 50; Block, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.27699 0.08162 47.50000 27.897 < 2e-16 ***
AYes -0.02308 0.09958 99.30000 -0.232 0.81719
BReduced -0.11036 0.06232 733.00000 -1.771 0.07700 .
CSupp -0.02152 0.06243 733.70000 -0.345 0.73039
AYes:BReduced 0.25113 0.08838 733.70000 2.841 0.00462 **
AYes:CSupp 0.02179 0.08854 734.50000 0.246 0.80567
BReduced:CSupp 0.19436 0.08838 733.10000 2.199 0.02817 *
AYes:BReduced:CSupp -0.21746 0.12507 734.20000 -1.739 0.08251 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) AYes BRedcd CSupp AYs:BR AYs:CS BRd:CS
AYes -0.607
BReduced -0.379 0.311
CSupp -0.379 0.311 0.498
AYes:BRedcd 0.269 -0.444 -0.706 -0.354
AYes:CSupp 0.268 -0.444 -0.352 -0.708 0.503
BRedcd:CSpp 0.267 -0.219 -0.706 -0.705 0.498 0.500
AYs:BRdc:CS -0.190 0.315 0.500 0.502 -0.709 -0.709 -0.708
The summary above uses the lmerTest
package to compute P-values from the t-statistic using Satterthwaites's approximation to the denominator degrees of freedom. From this we see that both the A:B
and B:C
interaction are significant at the p = 0.05 level. In theory, these results should be consistent, at the very least qualitatively, with those produced from the anova()
function in the lmerTest
package, which computes P-values in the same way. However this isn't the case; Here is the output from anova(model, type = 3)
. Notice the type 3 test for SS
Analysis of Variance Table of type III with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
A 0.09492 0.09492 1 49.87 0.4986 0.48342
B 0.66040 0.66040 1 732.66 3.4688 0.06294 .
C 0.20207 0.20207 1 733.90 1.0614 0.30324
A:B 0.99470 0.99470 1 732.56 5.2247 0.02255 *
A:C 0.36903 0.36903 1 733.66 1.9383 0.16427
B:C 0.35867 0.35867 1 733.20 1.8839 0.17031
A:B:C 0.57552 0.57552 1 734.23 3.0230 0.08251 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These results clearly differ. The B:C
interaction is no longer significant and the P-value for the A:B
interaction is quite a bit higher. Both models should be computing the P-values in similar ways and so it's hard to imagine them being so different.
Why are they different?
This was a part of the original question, but it can be misleading, see the answer below.
In fact, it seems that the anova(model, type = 3)
function is actually using type 2 SS, which we can verify by running anova(model, type = 2)
.
Analysis of Variance Table of type II with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
A 0.09526 0.09526 1 49.87 0.5004 0.48263
B 0.65996 0.65996 1 732.66 3.4665 0.06302 .
C 0.19639 0.19639 1 733.91 1.0315 0.31013
A:B 0.99282 0.99282 1 732.56 5.2148 0.02268 *
A:C 0.37018 0.37018 1 733.65 1.9444 0.16362
B:C 0.35523 0.35523 1 733.20 1.8659 0.17237
A:B:C 0.57552 0.57552 1 734.23 3.0230 0.08251 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The results are very similar, which should not be the case given the presence of interactions in the model. To show that lmerTest::anova()
is in fact using type 2 SS rather than the type 3 SS it displays in its output we can use the Anova()
function from the car
package. Anova(model, type = 2, test.statistic = 'F')
produces:
Analysis of Deviance Table (Type II Wald F tests with Kenward-Roger df)
Response: Biomass
F Df Df.res Pr(>F)
A 0.4857 1 48.28 0.48917
B 3.4537 1 726.63 0.06351 .
C 1.0337 1 727.77 0.30962
A:B 5.1456 1 726.54 0.02360 *
A:C 1.9302 1 727.55 0.16517
B:C 1.8776 1 727.12 0.17103
A:B:C 2.9915 1 728.06 0.08413 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the use of Kenward-Roger ddf does not change the results by much for my data. What's clear is that the type 2 SS results from the Car
packaged produced results analogous to the type 3 SS results from the lmerTest
package. This suggests that the lmerTest
package is in fact computing type 2 SS. I struggle trying to figure out why this would be the case unless there is a problem in the computation of P-values from the lmerTest
package. Am I missing something?
Any suggestions or ideas are welcome. Thanks a bunch!
Edit: December 6 2016, 11:40 am
A few folks have indicated that this question is duplicated from here. However I don't see how this is. That post aims to understand why aov()
and lme()
are producing different F-statistics, which it turn out relates to how the variance components are calculated from the different functions. Here I am running only a single model using lmer
and trying to understand why lmerTest::anova(model)
and summary(model)
are producing different P-values, despite the fact that they should be computed in similar ways. lmerTest::anova()
seems to be using type 2 SS rather than the reported type 3 SS, which should only matter in the presence of interactions, which the other post does not contain in any of the listed models.
Best Answer
In what is undoubtedly a newbie mistake, I have solved the problem I previously detailed above. The key to getting
lmerTest::anova()
andsummary()
to agree is to ensure that the global contrasts in R are changed by inputingIf these contrasts are not set, then
summary()
will produce the default type I sums of squares (note thatcar::Anova()
will also use this default). On the other hand,lmerTest::anova()
will produce type III SS, regardless of the default contrasts, suggesting it does this internally. If the default contrasts are set as noted above, then all three function will produce output using the desired type III SS.