Solved – Least Square Means vs arithmetic means in linear mixed models

lsmeansmixed modelr

I am studying the effects of year, topography and land use on biodiversity of an African forest. I run lmer model for my data which have the following structure:

str(BIData)
data.frame':   90 obs. of  9 variables:
 $ Year          : Factor w/ 9 levels "1994","1995",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Plot          : Factor w/ 10 levels "Bas Kolel","Bougou",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Richness      : int  8 21 13 14 8 10 6 10 8 20 ...
 $ Abundance     : int  286 1471 1121 466 242 97 250 790 208 2015 ...
 $ Shannon       : num  1.33 1.79 1.55 1.68 1.44 1.71 1.35 1.27 1.27 1.86 ...
 $ Simpson       : num  0.656 0.71 0.682 0.694 0.665 0.714 0.66 0.647 0.649 0.718 ...
 $ InverseSimpson: num  2.91 3.45 3.14 3.28 2.99 3.52 2.95 2.83 2.86 3.54 ...
 $ Topography    : Factor w/ 3 levels "Plateau","Slope",..: 3 1 1 3 3 2 2 2 3 1 ...
 $ Land_use      : Factor w/ 2 levels "Cultivated","Pasture": 1 2 2 2 1 1 2 2 1

I run this lmer model:

model4=lmer(Abundance~Year+Topography+Land_use+ Year:Topography+(1|Plot), data=BIData)

followed by:

lsmeanmodel4=lsmeans(model4, "Year") 

in order to calculate Least Square Means, Standard Error, and C.I (upper/lower limits). As my experiment is balanced I expect that LS Means agree with the arethmetic Means but I got different result. Moreover, the C.I is also huge. Can anybody give me some hints on what causes this differences and how I overcome this problems?

Here is an example of what I obtained:

library(lsmeans)
lsmeanmodel4=lsmeans(model4, "Year")
print(lsmeanmodel4)
Year   lsmean       SE    df lower.CL upper.CL
 1994 713.6576 100.2217 13.51 497.9625 929.3527
 1995 698.3520 100.2217 13.51 482.6570 914.0471
 1996 682.7131 100.2217 13.51 467.0181 898.4082
 1998 489.6020 100.2217 13.51 273.9070 705.2971
 1999 334.4631 100.2217 13.51 118.7681 550.1582
 2002 392.5465 100.2217 13.51 176.8514 608.2415
 2004 361.3798 100.2217 13.51 145.6847 577.0749
 2010 404.7687 100.2217 13.51 189.0736 620.4638
 2015 318.4909 100.2217 13.51 102.7959 534.1860

Results are averaged over the levels of: Topography, Land_use 
Confidence level used: 0.95

tapply(BIData$Abundance, list(BIData$Year),mean)
 1994  1995  1996  1998  1999  2002  2004  2010  2015 
694.6 695.0 661.9 496.5 352.7 398.8 369.7 403.0 323.6

Thank you in advance

Best Answer

Least-squares means are not means of the data values. They are marginal averages of predictions from the model, taken over a grid consisting of all factor combinations.

If your model had included all the two- and three-way interactions, then those predictions would be the same as the cell means, making the LS means the same as the raw means, provided the data are balanced. (You can try this to verify).

However, your model excludes most of the interactions, so its predictions don't match the cell means, and the marginal averages of the predictions don't match the marginal means.