Mixed Models in R – Resolving Conflicting Results of summary() and anova() with lmer+lmerTest

I recently came across what I think may be a problem in how the anova() function from the lmerTest packages computes its F-statistics and corresponding P-values for fixed effects from mixed-effects models. Let me start by saying that I know of the controversy surrounding calculating P-values from mixed effects models (for reason discussed here). Nonetheless, many folks still want P-values and thus a number of ways have been developed to accommodate this (see here). Here I want to show the results of a commonly used approach — namely, the anova function from the lmerTest package — and hope that someone has an idea of why the results are not quite making sense.

First here is my data. I had to link to it because of its size. Note that the biomass column has been standardized (mean = 0, sd = 1), hence the negative values. This does not alter the output. Once downloaded and the working directory has been specified, the file can be read in as follows:

dat <- read.csv("StackOverflow_Data.csv", header = T)

Below is my model using the lmer function from lme4. In this model I have plant biomass as a response variable and three factors — A, B, and C — each with two levels, as predictors. Plant Genotype and spatial block are included as random effects.

model <- lmer(Biomass ~ A + B + C + 
            A:B + A:C + 
            B:C + A:B:C +
            (1 | Genotype) + (1 | Block) , 
          data = dat, REML = T)

Summarizing the above model using summary(model) we get:

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations

  to degrees of freedom [lmerMod]
Formula: Biomass ~ A + B + C + A:B + A:C + B:C + A:B:C + (1 | Genotype) +  
    (1 | Block)
   Data: dat

 AIC      BIC   logLik deviance df.resid 
  1059.7   1111.0   -518.8   1037.7      776 
Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.04330 -0.63914  0.00315  0.69108  2.82368 

Random effects:
 Groups   Name        Variance Std.Dev.
 Genotype (Intercept) 0.07509  0.2740  
 Block    (Intercept) 0.01037  0.1018  
 Residual             0.19038  0.4363  
Number of obs: 787, groups:  Genotype, 50; Block, 6

Fixed effects:
                     Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)           2.27699    0.08162  47.50000  27.897  < 2e-16 ***
AYes                 -0.02308    0.09958  99.30000  -0.232  0.81719    
BReduced             -0.11036    0.06232 733.00000  -1.771  0.07700 .  
CSupp                -0.02152    0.06243 733.70000  -0.345  0.73039    
AYes:BReduced         0.25113    0.08838 733.70000   2.841  0.00462 ** 
AYes:CSupp            0.02179    0.08854 734.50000   0.246  0.80567    
BReduced:CSupp        0.19436    0.08838 733.10000   2.199  0.02817 *  
AYes:BReduced:CSupp  -0.21746    0.12507 734.20000  -1.739  0.08251 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) AYes   BRedcd CSupp  AYs:BR AYs:CS BRd:CS
AYes        -0.607                                          
BReduced    -0.379  0.311                                   
CSupp       -0.379  0.311  0.498                            
AYes:BRedcd  0.269 -0.444 -0.706 -0.354                     
AYes:CSupp   0.268 -0.444 -0.352 -0.708  0.503              
BRedcd:CSpp  0.267 -0.219 -0.706 -0.705  0.498  0.500       
AYs:BRdc:CS -0.190  0.315  0.500  0.502 -0.709 -0.709 -0.708

The summary above uses the lmerTest package to compute P-values from the t-statistic using Satterthwaites's approximation to the denominator degrees of freedom. From this we see that both the A:Band B:C interaction are significant at the p = 0.05 level. In theory, these results should be consistent, at the very least qualitatively, with those produced from the anova() function in the lmerTest package, which computes P-values in the same way. However this isn't the case; Here is the output from anova(model, type = 3). Notice the type 3 test for SS

Analysis of Variance Table of type III  with  Satterthwaite 
approximation for degrees of freedom
       Sum Sq Mean Sq NumDF  DenDF F.value  Pr(>F)  
A     0.09492 0.09492     1  49.87  0.4986 0.48342  
B     0.66040 0.66040     1 732.66  3.4688 0.06294 .
C     0.20207 0.20207     1 733.90  1.0614 0.30324  
A:B   0.99470 0.99470     1 732.56  5.2247 0.02255 *
A:C   0.36903 0.36903     1 733.66  1.9383 0.16427  
B:C   0.35867 0.35867     1 733.20  1.8839 0.17031  
A:B:C 0.57552 0.57552     1 734.23  3.0230 0.08251 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These results clearly differ. The B:C interaction is no longer significant and the P-value for the A:B interaction is quite a bit higher. Both models should be computing the P-values in similar ways and so it's hard to imagine them being so different.

Why are they different?

This was a part of the original question, but it can be misleading, see the answer below.

In fact, it seems that the anova(model, type = 3) function is actually using type 2 SS, which we can verify by running anova(model, type = 2).

Analysis of Variance Table of type II  with  Satterthwaite 
approximation for degrees of freedom
       Sum Sq Mean Sq NumDF  DenDF F.value  Pr(>F)  
A     0.09526 0.09526     1  49.87  0.5004 0.48263  
B     0.65996 0.65996     1 732.66  3.4665 0.06302 .
C     0.19639 0.19639     1 733.91  1.0315 0.31013  
A:B   0.99282 0.99282     1 732.56  5.2148 0.02268 *
A:C   0.37018 0.37018     1 733.65  1.9444 0.16362  
B:C   0.35523 0.35523     1 733.20  1.8659 0.17237  
A:B:C 0.57552 0.57552     1 734.23  3.0230 0.08251 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results are very similar, which should not be the case given the presence of interactions in the model. To show that lmerTest::anova() is in fact using type 2 SS rather than the type 3 SS it displays in its output we can use the Anova() function from the car package. Anova(model, type = 2, test.statistic = 'F') produces:

Analysis of Deviance Table (Type II Wald F tests with Kenward-Roger df)

Response: Biomass
           F Df Df.res  Pr(>F)  
A     0.4857  1  48.28 0.48917  
B     3.4537  1 726.63 0.06351 .
C     1.0337  1 727.77 0.30962  
A:B   5.1456  1 726.54 0.02360 *
A:C   1.9302  1 727.55 0.16517  
B:C   1.8776  1 727.12 0.17103  
A:B:C 2.9915  1 728.06 0.08413 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that the use of Kenward-Roger ddf does not change the results by much for my data. What's clear is that the type 2 SS results from the Car packaged produced results analogous to the type 3 SS results from the lmerTest package. This suggests that the lmerTest package is in fact computing type 2 SS. I struggle trying to figure out why this would be the case unless there is a problem in the computation of P-values from the lmerTest package. Am I missing something?

Any suggestions or ideas are welcome. Thanks a bunch!

Edit: December 6 2016, 11:40 am

A few folks have indicated that this question is duplicated from here. However I don't see how this is. That post aims to understand why aov() and lme() are producing different F-statistics, which it turn out relates to how the variance components are calculated from the different functions. Here I am running only a single model using lmer and trying to understand why lmerTest::anova(model) and summary(model) are producing different P-values, despite the fact that they should be computed in similar ways. lmerTest::anova() seems to be using type 2 SS rather than the reported type 3 SS, which should only matter in the presence of interactions, which the other post does not contain in any of the listed models.

Best Answer

In what is undoubtedly a newbie mistake, I have solved the problem I previously detailed above. The key to getting lmerTest::anova() and summary() to agree is to ensure that the global contrasts in R are changed by inputing

options(contrasts = c("contr.sum","contr.poly"))

If these contrasts are not set, then summary() will produce the default type I sums of squares (note that car::Anova() will also use this default). On the other hand, lmerTest::anova() will produce type III SS, regardless of the default contrasts, suggesting it does this internally. If the default contrasts are set as noted above, then all three function will produce output using the desired type III SS.

This was a part of the original question, but it can be misleading, see the answer below.

Best Answer

Related Solutions

Solved – F-test differences Stata and R

Solved – Differences between summary and anova function for multilevel (lmer) model

Related Question