I am studying the effects of year, topography and land use on biodiversity of an African forest. I run lmer model for my data which have the following structure:
str(BIData)
data.frame': 90 obs. of 9 variables:
$ Year : Factor w/ 9 levels "1994","1995",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 10 levels "Bas Kolel","Bougou",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Richness : int 8 21 13 14 8 10 6 10 8 20 ...
$ Abundance : int 286 1471 1121 466 242 97 250 790 208 2015 ...
$ Shannon : num 1.33 1.79 1.55 1.68 1.44 1.71 1.35 1.27 1.27 1.86 ...
$ Simpson : num 0.656 0.71 0.682 0.694 0.665 0.714 0.66 0.647 0.649 0.718 ...
$ InverseSimpson: num 2.91 3.45 3.14 3.28 2.99 3.52 2.95 2.83 2.86 3.54 ...
$ Topography : Factor w/ 3 levels "Plateau","Slope",..: 3 1 1 3 3 2 2 2 3 1 ...
$ Land_use : Factor w/ 2 levels "Cultivated","Pasture": 1 2 2 2 1 1 2 2 1
I run this lmer model:
model4=lmer(Abundance~Year+Topography+Land_use+ Year:Topography+(1|Plot), data=BIData)
followed by:
lsmeanmodel4=lsmeans(model4, "Year")
in order to calculate Least Square Means, Standard Error, and C.I (upper/lower limits). As my experiment is balanced I expect that LS Means agree with the arethmetic Means but I got different result. Moreover, the C.I is also huge. Can anybody give me some hints on what causes this differences and how I overcome this problems?
Here is an example of what I obtained:
library(lsmeans)
lsmeanmodel4=lsmeans(model4, "Year")
print(lsmeanmodel4)
Year lsmean SE df lower.CL upper.CL
1994 713.6576 100.2217 13.51 497.9625 929.3527
1995 698.3520 100.2217 13.51 482.6570 914.0471
1996 682.7131 100.2217 13.51 467.0181 898.4082
1998 489.6020 100.2217 13.51 273.9070 705.2971
1999 334.4631 100.2217 13.51 118.7681 550.1582
2002 392.5465 100.2217 13.51 176.8514 608.2415
2004 361.3798 100.2217 13.51 145.6847 577.0749
2010 404.7687 100.2217 13.51 189.0736 620.4638
2015 318.4909 100.2217 13.51 102.7959 534.1860
Results are averaged over the levels of: Topography, Land_use
Confidence level used: 0.95
tapply(BIData$Abundance, list(BIData$Year),mean)
1994 1995 1996 1998 1999 2002 2004 2010 2015
694.6 695.0 661.9 496.5 352.7 398.8 369.7 403.0 323.6
Thank you in advance
Best Answer
Least-squares means are not means of the data values. They are marginal averages of predictions from the model, taken over a grid consisting of all factor combinations.
If your model had included all the two- and three-way interactions, then those predictions would be the same as the cell means, making the LS means the same as the raw means, provided the data are balanced. (You can try this to verify).
However, your model excludes most of the interactions, so its predictions don't match the cell means, and the marginal averages of the predictions don't match the marginal means.