Post-Hoc Tests in LMER Models – Huge Degrees of Freedom and Emmeans Analysis

degrees of freedomlme4-nlmelsmeanspost-hoct-test

I have a trial-wise linear mixed model with a categorial between-subjects factor group (A vs B) and two categorical within-subjects factors itemEmotion (neutral vs. negative) and neuralArea (frontal vs. posterior) and their interactions and random intercepts for each subject (n=52, 26 per group) and item (60 items in total, 30 neutral, 30 negative):

LMM <- lmer(neuralActivity ~ group * itemEmotion * neuralArea +
              (1 | subject) + (1| item), data = data)
summary(LMM)

Based on a significant group x neuralArea interaction I ran post-hoc tests on the difference between frontal and posterior neuralArea in each group using emmeans():

LMM.emmeans = emmeans (LMM, pairwise ~ neuralArea|group, lmer.df = "satterthwaite", 
                       lmerTest.limit=6240)
summary(LMM.emmeans)

with those results:

$emmeans
group = A:
 neuralArea  emmean      SE  df   lower.CL upper.CL
 anterior    0.00295 0.00133 119  0.0003165  0.00559
 posterior   0.00536 0.00133 119  0.0027233  0.00800

group = B:
 neuralArea  emmean      SE  df   lower.CL upper.CL
 anterior    0.00257 0.00133 119 -0.0000661  0.00521
 posterior   0.01082 0.00133 119  0.0081836  0.01346

Results are averaged over the levels of: itemEmotion 
Degrees-of-freedom method: satterthwaite 
Confidence level used: 0.95 

$contrasts
group = A:
 contrast              estimate      SE   df t.ratio p.value
 anterior - posterior  -0.00241 0.00157 6124  -1.531  0.1258

group = B:
 contrast               estimate      SE   df t.ratio p.value
 anterior - posterior   -0.00825 0.00157 6124  -5.248  <.0001

Results are averaged over the levels of: itemEmotion 
Degrees-of-freedom method: satterthwaite

What really confuses me are the 6124 df's for the t-tests as I've never seen such high df's reported. Is such a number possible with only 60 trials and 26 participants per group or is something off here? Note, that in this example I implemented lmerTest.limit=6240 because otherwise my df's are reported to be INF and z-tests instead of t-tests are computed and that I used satterthwaite's method because the computation is much faster than with Kennward-Roger, but both methods give me the same number for the degrees of freedom in those contrasts.

Also, p-value adjustment for multiple testing does not seem to work with any contrasts that I computed for this model as I get the same p-value and no comment regarding the p-value adjustment in my results outputs for each p.adjust.methods= (e.g. "fdr", "bonferroni", "none") command that I've added either into the emmeans() or the summary() function. The adjustment does actually work using multcomp::glht(), but this gives me only z-values instead of t-values and again no df's at all. Or is a z-test actually more appropriate in this case?

However, I'm mainly interested in knowing whether those degrees of freedom of the t-test make sense?

Thank you very much in advance!

Best Answer

First, P value adjustments are done separately for each "by" group, and you have only one comparison in each group, hence no multiplicity of tests, hence no multiplicity adjustments are needed nor done. If you do summary(LMM.emmeans$contrasts, by = NULL, adjust = "bonf"), you will see an adjustment for the two comparisons considered as one family of tests.

Second, when you have a within-subjects comparison, the subject effects cancel out because they are on the same subject. That means the degrees of freedom needed to estimate the subject variations do no play a role, and that makes the d.f. for the comparison a lot greater than the d.f. for the means themselves. It will not exceed the number of observations in the dataset, however.

More details

Let me create your data in a reproducible way, by providing the random-number seed:

set.seed(12345)
data <- data.frame(condition = as.factor(rep(c(0,1), times = 10)), 
    value = rnorm(20), type = rep(c("A","B"), each = 10), 
    subject = as.factor(rep(rep(1:5, each = 2), times = 2)))

Here are your t-test results:

> t.test(data[data$condition==0 & data$type=="A","value"], 
+        data[data$condition==1 & data$type=="A","value"], paired = TRUE)

    Paired t-test

data:  data[data$condition == 0 & data$type == "A", "value"] and 
       data[data$condition == 1 & data$type == "A", "value"]
t = 1.9384, df = 4, p-value = 0.1246
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3618957  2.0361137
sample estimates:
mean of the differences 
               0.837109

Now we compare with emmeans() results. The important thing to know about emmeans() is that it provides an interpretation of a fitted model, not of the dataset itself. If you give it a different fitted model, you will get different results. Here is what we get with your model:

> library(lme4)
> mod1 = lmer(value ~ condition*type + (1|subject), data = data)
> pairs(emmeans(mod1, ~ condition | type))

type = A:
 contrast estimate    SE df t.ratio p.value
 0 - 1       0.837 0.475 12  1.764  0.1032 

type = B:
 contrast estimate    SE df t.ratio p.value
 0 - 1      -0.677 0.475 12 -1.426  0.1795

Each t test shown indeed has 12 d.f., and that is because of information extracted from the model provided it, mod1. Now consider a different model: one that is fitted only to the data for type A:

> mod2 = lmer(value ~ condition + (1|subject), data = data, 
+             subset = (type == "A"))

This model does not include type as a predictor, so of course we have to alter the emmeans() call accordingly:

> pairs(emmeans(mod2, ~ condition))
 contrast estimate    SE df t.ratio p.value
 0 - 1       0.837 0.432  4 1.938   0.1246

With model mod2, we get only 4 d.f. And, in fact, the t test and P value are identical to what was obtained above using t.test().

Again, the point is that emmeans() summarizes a fitted model. The model mod1 uses data from both types, and it incorporates an assumption that the error variance is the same with each type, and that the subject variance is the same with each type. It pools the information from the two types to obtain more accurate* estimates of these variances, and that is reflected in the increased degrees of freedom. [* more accurate, assuming the model assumptions are correct!]

Because of this, if you are satisfied, based on diagnostic plots, etc., that mod1 fits the data adequately, then the emmeans() results from mod1 are better than those from mod2 because more efficient use is made of available information. If, on the other hand, mod1 does not fit well, mod2 may be better and hence you should give that model to emmeans().

Best Answer

Related Solutions

Hypothesis Testing – Big Difference Between a T-Test and an F-Test in a Mixed Model Explained

Solved – How are the degrees of freedom in the emmeans package calculated? – R

More details

Related Question