I have a mixed model with two fixed factors (Insecte and temperature) with two random factors (bloc and date) nested in the temperature treatment. I also include interactions between the fixed factor Insecte and the random factors. The response variable is aphid density. This give the following model:
m2.nlmer = lmer(aphid_density~ Insecte*amp_freq
+(1|bloc:(date:amp_freq))+(1|date:amp_freq)+(1|Insecte:(bloc:(date:amp_freq)))+(1|Insecte:(date:amp_freq)),
na.action=na.exclude,
data = fr2)
where amp_freq are the temperature treatments. I detected significant interaction between Insecte and amp_freq and want to analyse Insecte effect for each temperature levels as follows:
> lsmeans(m2.nlmer, pairwise ~ Insecte|amp_freq, adjustSigma = TRUE, adjust = "tukey")
$lsmeans
amp_freq = 23-const:
Insecte lsmean SE df lower.CL upper.CL
COCCINELLE 568.17947 160.6389 5.39 164.0381 972.3208
PUCERON 1783.53300 160.0224 5.32 1379.5075 2187.5585
amp_freq = 30-jour:
Insecte lsmean SE df lower.CL upper.CL
COCCINELLE 570.64910 161.6708 5.48 165.8695 975.4287
PUCERON 1314.53794 160.0629 5.32 910.4669 1718.6089
amp_freq = 30-semaine:
Insecte lsmean SE df lower.CL upper.CL
COCCINELLE 738.12107 163.2045 5.74 334.3272 1141.9149
PUCERON 1867.47572 160.0204 5.32 1463.4669 2271.4845
amp_freq = 40-jour:
Insecte lsmean SE df lower.CL upper.CL
COCCINELLE 84.87312 162.5806 5.64 -319.1406 488.8869
PUCERON 256.55122 159.6785 5.27 -147.5851 660.6875
amp_freq = 40-semaine:
Insecte lsmean SE df lower.CL upper.CL
COCCINELLE 296.63947 159.6828 5.27 -107.5317 700.8107
PUCERON 837.65225 160.0241 5.32 433.6129 1241.6916
Confidence level used: 0.95
$contrasts
amp_freq = 23-const:
contrast estimate SE df t.ratio p.value
COCCINELLE - PUCERON -1215.3535 61.59491 4.44 -19.731 <.0001
amp_freq = 30-jour:
contrast estimate SE df t.ratio p.value
COCCINELLE - PUCERON -743.8888 64.11036 4.99 -11.603 0.0001
amp_freq = 30-semaine:
contrast estimate SE df t.ratio p.value
COCCINELLE - PUCERON -1129.3546 68.03274 6.60 -16.600 <.0001
amp_freq = 40-jour:
contrast estimate SE df t.ratio p.value
COCCINELLE - PUCERON -171.6781 65.61395 5.69 -2.616 0.0418
amp_freq = 40-semaine:
contrast estimate SE df t.ratio p.value
COCCINELLE - PUCERON -541.0128 58.89646 3.77 -9.186 0.0010
However, I get a different results if I split the data set by temperature level and run again the model. For instance:
m40_jour = lmer(aphid_density~ Insecte+(1|bloc:(date:amp_freq))+(1|date:amp_freq)+(1|Insecte:(bloc:(date:amp_freq)))+(1|Insecte:(date:amp_freq)),
na.action=na.exclude,
data = fr2[which(fr2$amp_freq == "40-jour"),])
lsmeans(m40_jour, pairwise ~ Insecte, adjustSigma = TRUE, adjust = "tukey")
$lsmeans
Insecte lsmean SE df lower.CL upper.CL
COCCINELLE 81.90721 29.60546 2.37 -28.08425 191.8987
PUCERON 258.15338 25.84196 1.58 113.00870 403.2981
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
COCCINELLE - PUCERON -176.2462 34.73064 1.02 -5.075 0.1202
Finally, I get a different result if I use the glht function with mcp:
summary(glht(m40_jour, mcp(Insecte="Tukey")))
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lmer(formula = aphid_density ~ Insecte + (1 | bloc:(date:amp_freq)) +
(1 | date:amp_freq) + (1 | Insecte:(bloc:(date:amp_freq))) +
(1 | Insecte:(date:amp_freq)), data = fr2[which(fr2$amp_freq ==
"40-jour"), ], na.action = na.exclude)
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
PUCERON - COCCINELLE == 0 176.25 33.43 5.273 1.34e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
So I do not understand why (1) I get different results between lsmeans with the interaction between fixed factor and lsmeans with the split dataset and model m40-day and (2) lsmeans and glht produce different results?
What is the correct/more robust approach?
Thank you for your help!
Best Answer
This is apparently a messy dataset (as opposed to a nice balance experiment). When you fit a model to only a subset, only the data in that subset are available to obtain estimates. But there may be partial information about those effects in the other parts of the dataset. Also, in a mixed model with unbalanced data, the random effects can affect the fixed-effect estimates, and , that can cause a discrepancy between the full-data and the subset models as well.
Which is better? Well, that's not cut-and-dried. How well do you believe the underlying assumptions for the fuill model? They include the idea that the variances of each random effect is the same for each level of
amp_freq
. If that's believable, then the full-data model is preferable because you can pool the information about those effects from the whole dataset, making the tests more powerful and the confidence intervals shorter. But if you doubt those assumptions, then separate models may be better, or a more sophisticated model that allows for different variances acrossamp_freq
.For the 2nd part of your question, the main reason is degrees of freedom. The test from
lsmeans
uses 1.02 df and the one fromglht
uses infinite df (asymptotic z test). That makes a huge difference in the P value. Also, while the estimates match, the SEs differ slightly. That's becauselsmeans
uses an adjusted covariance matrix from the pbkrtest package. That package is also used to get the degrees of freedom. So, if you do lsm.options(disable.pbkrtest = TRUE), thelsmeans
results will match those fromglht
. But I suggest that isn't a good idea, because that 1.02 df is a warning that you really don't have a lot of information about your estimate of that difference.