Least square means confidence interval with Tukey adjustment balloons unrealistically wide in mixed model

mixed modelr

I have data from an experiment with two fully crossed treatments. Response variables were measured in two separate years. I fit a linear mixed effects model to a log-transformed response variable, with the two main fixed effects, interaction fixed effect, and random effect of year (using lmer() from the R package lme4). Then, I used the function emmeans::emmeans() to get the estimated marginal means for one of the treatments, and multcomp::cld() to do the Tukey adjustment and see whether any of the treatments differ. My issue is that with the log transformation, the Tukey adjustment causes the 95% confidence interval around the treatment means to become incredibly wide, many orders of magnitude higher and lower than any observed data value. This seems unrealistic. I was wondering whether this is an issue with my R syntax or whether I've set up the tests incorrectly.

Here is a slightly fuzzed version of the dataset, with names of factor levels anonymized and some random noise added to the data (as it's private) for a reproducible example.

Load packages and data.

library(lme4)
library(emmeans)
library(multcomp)

dat_reprex <- structure(list(Year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
 2), Plot = c(1001, 1003, 1009, 1010, 1011, 1014, 1016, 1022, 1023, 
 1024, 1027, 1029, 1035, 1036, 1037, 1040, 1042, 1048, 1049, 1050, 
 1053, 1055, 1061, 1062, 1063, 2001, 2003, 2009, 2010, 2011, 2014, 
 2016, 2022, 2023, 2024, 2027, 2029, 2035, 2036, 2037, 2040, 2042, 
 2048, 2049, 2050, 2053, 2055, 2061, 2062, 2063, 3001, 3003, 3009, 
 3010, 3011, 3014, 3016, 3022, 3023, 3024, 3027, 3029, 3035, 3036, 
 3037, 3040, 3042, 3048, 3049, 3050, 3053, 3055, 3061, 3062, 3063, 
 1001, 1003, 1009, 1010, 1011, 1014, 1016, 1022, 1023, 1024, 1027, 
 1029, 1035, 1036, 1037, 1040, 1042, 1048, 1049, 1050, 1053, 1055, 
 1061, 1062, 1063, 2001, 2003, 2009, 2010, 2011, 2014, 2016, 2022, 
 2023, 2024, 2027, 2029, 2035, 2036, 2037, 2040, 2042, 2048, 2049, 
 2050, 2053, 2055, 2061, 2062, 2063, 3001, 3003, 3009, 3010, 3011,
 3014, 3016, 3022, 3023, 3024, 3027, 3029, 3035, 3036, 3037, 3040, 
 3042, 3048, 3049, 3050, 3053, 3055, 3061, 3062, 3063), Trt1 = 
 structure(c(3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 
 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 
 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 
 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 
 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 
 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 
 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 
 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 
 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L), 
 .Label = c("control", "level1", "level2", "level3"), class = 
 "factor"), Trt2 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
  3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 
 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L,
 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 5L, 
 5L, 5L, 5L, 5L), .Label = c("control", "level1", "level2", "level3", 
 "level4"), class = "factor"), Y = c(2158.59905970595, 
 1251.4727731083, 1790.19817919615, 1597.19282719814, 1687.0106965681,
 1720.75607746878, 1564.15753862752, 1505.82507506224, 
 1607.45705437054, 1570.36888939712, 1305.95650235435, 
 1233.34290117997, 1440.36874372446, 1209.82514463767, 
 1801.6777811149, 1548.6815875682, 1808.21301292584, 1442.22647693023,
 1514.35463381467, 1658.26138464109, 1309.91739898201, 
 1710.5053649517, 1406.81202399823, 1233.84694614133, 
 1523.13259673538, 1340.63244495632, 1336.91901503359, 
 1868.83212405745, 966.156768650265, 772.39479816532, 
 838.530758532424, 1674.51520638795, 1997.39554060529, 
 1565.87718877466, 1422.75213461409, 813.229634449323, 
 1013.53380296962, 2158.55146827853, 2022.86437017822, 
 1439.1910289597,
 921.481485470946, 1171.54630364696, 2043.22666236862, 
 1597.97904932079, 1674.86250712198, 1471.74181876526, 
 1579.25893093279, 2310.6112810042, 1942.56675761732, 1738.1611751111,
 1965.92272530422, 2218.76071019398, 2069.38801363651, 
 1740.44642106525, 1479.38268171168, 1596.78675731629, 
 2087.53773652064, 2187.30823864764, 1892.06356745912, 
 1502.64027585401, 1570.65804767824, 1572.770955274, 2470.56820319357,
 2114.27120093182, 1692.52887244199, 1343.34761356524, 
 1563.32633474988, 2263.51826447082, 2088.28208660775, 
 1884.99379164318, 1682.63161008498, 1608.09824983156, 
 2005.59693333264, 1989.74208915916, 2063.48565368801, 
 4087.35614940249, 3221.39128174976, 3511.60850378984, 
 3312.70958631815, 2970.00203271205, 3924.25537269975, 
 4059.0383884004, 3489.88303736939, 3764.75925683024, 2892.7642548966,
 3749.5717873939, 3225.41521326006, 3698.37915484814, 
 2488.68187948746, 1638.01249594836, 3601.60167217308, 
 3946.39929365291, 3127.45486313203, 2540.92642036889, 
 2577.01180410805, 3895.47710383669, 3924.81526007975, 
 2863.00503003766, 2367.2490438887, 2122.59795071745, 
 4216.63325128058, 4752.55394742371, 3101.96796396627, 
 2976.86553176957, 2686.22237849168, 3387.60400274547, 
 3304.85510168601, 3486.77455705152, 3345.00280941006, 
 3391.75323528612, 3531.80514761301, 3528.52241670405, 
 3572.2370720877, 3258.38960287764, 3438.79078961425, 
 4325.86549340203, 2766.72375854368, 3140.26873373657, 
 2963.30088543439, 3589.23886653641, 3711.42299265126, 
 3411.76855792187, 2977.88964743111, 3528.89001371156, 
 3904.40755581469, 4272.61145419705, 3704.68025651859, 
 3440.39876615046, 3382.23008003269, 3694.68849971889, 
 3291.59326651953, 3387.51868191258, 3477.14997107478, 
 3425.66753942855, 3983.9954520286, 3563.393971622, 3753.18202786702,
 2892.42937739966, 3146.63511644724, 2866.05355993634, 
 3561.36950588379, 3560.02686268737, 3347.43624676822, 
 3722.73867296726, 3757.75260207119, 3750.18381722575, 
 3198.38828218515, 3024.21870303111, 3304.45898172044, 
 3141.58457659908)), row.names = c(NA, -150L), class = "data.frame")

Log-transform response, fit model, and get adjusted 95% CIs for treatment 1.

dat_reprex$logY <- log10(dat_reprex$Y)
fit <- lmer(logY ~ Trt1 * Trt2 + (1|Year), data = dat_reprex)
emmeans_trt1 <- emmeans(fit, specs = 'Trt1', mode = 'kenward-roger', 
    transform = 'log10', type = 'response')
cld_trt1 <- cld(emmeans_trt1, adjust = 'tukey', Letters = letters)

# Back transform confidence interval
transform(cld_trt1, 
          lower.CL.backtransform = 10 ^ lower.CL,
          upper.CL.backtransform = 10 ^ upper.CL)

The final output shows very unrealistic bounds for the confidence intervals:

     Trt1 response        SE       df  lower.CL upper.CL .group lower.CL.backtransform upper.CL.backtransform
2  level1 3.337662 0.1615288 1.103493 0.6030081 18.47402      a               4.008742           2.978669e+18
1 control 3.354987 0.1615297 1.103493 0.6115121 18.40673      a               4.088012           2.551095e+18
3  level2 3.363528 0.1611005 1.048960 0.4493374 25.17779      a               2.814086           1.505895e+25
4  level3 3.396371 0.1615217 1.103493 0.6320819 18.24974      a               4.286294           1.777235e+18

Best Answer

For starters, there is a difference between transform and tran. If we use tran instead of transform, we get

> ### Output 1 ###

> emmeans_trt1 <- emmeans(fit, specs = 'Trt1', mode = 'kenward-roger', tran = 'log10', type = 'response')
NOTE: Results may be misleading due to involvement in interactions

> cld(emmeans_trt1, adjust = 'tukey', Letters = letters)
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
 Trt1    response  SE   df lower.CL upper.CL .group
 level1      2177 810 1.02 5.64e-05 8.40e+10  a    
 control     2266 843 1.02 5.87e-05 8.74e+10  a    
 level2      2310 857 1.01 3.17e-05 1.68e+11  a    
 level3      2491 927 1.02 6.46e-05 9.61e+10  a    

Results are averaged over the levels of: Trt2 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 4 estimates 
Intervals are back-transformed from the log10 scale 
P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log10 scale 
significance level used: alpha = 0.05 
NOTE: Compact letter displays can be misleading
      because they show NON-findings rather than findings.
      Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead. 

These (correct) results are a lot different. Also, notice from the annotations that the intervals are already back-transformed.

What happened when you used transform is that, due to a timing issue, the initial results were viewed as not being on a log scale, then were converted to the log scale -- giving you results on the log-log scale.

It is really more straightforward and safer to incorporate the transformation in the model itself:

> ### Output 2 ###

> fit2 <- lmer(log10(Y) ~ Trt1 * Trt2 + (1|Year), data = dat_reprex)
> emm2_trt1 <- emmeans(fit2, specs = 'Trt1', mode = 'kenward-roger', type = 'response')
NOTE: Results may be misleading due to involvement in interactions
> cld(emm2_trt1)
 Trt1    response  SE   df lower.CL upper.CL .group
 level1      2177 810 1.02     23.1   204783  1    
 control     2266 843 1.02     24.1   213127  1    
 level2      2310 857 1.01     22.2   240105  1    
 level3      2491 927 1.02     26.5   234347  1 
... etc. ...

Now, why are the CIs so wide? One reason is that with the K-R d.f. method, the d.f., as shown, are all near 1, which essentially means that the sampling distribution of the EMMs is Cauchy -- very fat tailed, so fat that the mean and variance don't even exist.

But another aspect of this is that the d.f. for comparisons is much larger:

> ### Output 3 ###

> pairs(emm2_trt1)
 contrast         ratio     SE  df null t.ratio p.value
 control / level1 1.041 0.0559 129    1   0.744  0.8793
 control / level2 0.981 0.0456 129    1  -0.415  0.9758
 control / level3 0.909 0.0489 129    1  -1.767  0.2939
 level1 / level2  0.942 0.0438 129    1  -1.274  0.5812
 level1 / level3  0.874 0.0469 129    1  -2.511  0.0630
 level2 / level3  0.927 0.0431 129    1  -1.625  0.3682

Results are averaged over the levels of: Trt2 
Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log10 scale 

We have 129 d.f. for each comparison. If you prefer to see pairwise differences rather than ratios, you have to re-grid the results to the response scale. This uses the original d.f. of the means rather than that of the comparisons, so we have to override the d.f.:

> ### Output 4 ###

> pairs(regrid(emm2_trt1), df = 129, infer = TRUE)
 contrast         estimate  SE  df lower.CL upper.CL t.ratio p.value
 control - level1     88.7 124 129     -233      411   0.717  0.8903
 control - level2    -44.2 107 129     -324      235  -0.412  0.9764
 control - level3   -225.6 153 129     -623      172  -1.477  0.4543
 level1 - level2    -132.9 114 129     -431      165  -1.161  0.6524
 level1 - level3    -314.3 171 129     -760      131  -1.836  0.2614
 level2 - level3    -181.4 132 129     -524      161  -1.379  0.5149

Results are averaged over the levels of: Trt2 
Degrees-of-freedom method: user-specified 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 4 estimates 
P value adjustment: tukey method for comparing a family of 4 estimates

These intervals are reasonable-looking.

One additional reason that compact letter displays are so misleading is that it doesn't show important information about the comparisons, such as that the d.f. are dramatically different.

Addendum

Later, I noticed that outputs 1 and 2 have the same estimates, but different intervals; and those in Output 1 are very much wider. That's because of the adjust argument, which was applied to both the pairwise comparisons and the confidence intervals. For the CIs, the Tukey method is not appropriate because those means themselves aren't pairwise comparisons, and the adjust method was changed to "sidak", as noted before the output. You can confirm that if you simply list emmeans_trt1, you will get the same intervals as in Output 2, and that if you do confint(emmeans_trt1, adjust = "sidak"), you get the intervals in Output 1. The Sidak-adjusted intervals are a whole lot wider because we are reaching for lower and higher quantiles of the t distribution (0.0063 and 0.9937 in lieu of 0.025 and 0.975) with about 1 d.f., and as noted earlier, the tails of that distribution are extremely fat.

Related Question