I was working through the example here calculating least squares means by doBy. In order to make sure I understand what is going on, I would like to replicate the analysis manually. I can do it when the number of untreated and treated are the same, but failing to do it when they are not the same.
I am copying the example here:
require(doBy)
zz <- "treat year y
1 t1 1 0.5
2 t1 1 1.0
3 t1 1 1.5
4 t2 1 3.0
5 t1 2 3.0
6 t2 2 4.5
7 t2 2 5.0
8 t2 2 5.5"
simdat <- read.table(text = zz, header =TRUE)
msim <- lm(y ~ treat + year, data=simdat)
LSmeans( msim, effect="treat")
It gives the following:
estimate se df t.stat p.value treat year
1 2 0.2415229 5 8.280787 4.191542e-04 t1 1.5
2 4 0.2415229 5 16.561573 1.465478e-05 t2 1.5
The estimates can be manually calcualted as follows:
t1_y1 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 1])
t1_y2 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 2])
(t1_y1 + t1_y2)/2
[1] 2
t2_y1 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 1])
t2_y2 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 2])
(t2_y1 + t2_y2)/2
[1] 4
However, when I make it unbalanced, I don't know how to calculate the estimates. For example, as below:
zz <- "treat year y
1 t1 1 0.5
2 t1 1 1.0
3 t1 1 1.5
4 t2 1 3.0
5 t1 2 3.0
6 t2 2 4.5
7 t2 2 5.0
8 t1 2 5.5"
simdat <- read.table(text = zz, header =TRUE)
msim <- lm(y ~ treat + year, data=simdat)
LSmeans( msim, effect="treat")
estimate se df t.stat p.value treat year
1 2.571429 0.4400255 5 5.843817 0.002076749 t1 1.5
2 3.714286 0.5729889 5 6.482299 0.001302694 t2 1.5
#unweighted means
t1_y1 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 1])
t1_y2 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 2])
(t1_y1 + t1_y2)/2
[1] 2.625
t2_y1 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 1])
t2_y2 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 2])
(t2_y1 + t2_y2)/2
[1] 3.875
#maybe weighted means?
t1_y1 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 1])
t1_y2 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 2])
(t1_y1*(5/8) + t1_y2 * (3/8))
[1] 2.21875
t2_y1 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 1])
t2_y2 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 2])
(t2_y1*(5/8) + t2_y2 * (3/8))
[1] 3.65625
How can I manually calculate the least square means?
Best Answer
The issue, which I overlooked initially, is that you fitted an additive model (no interaction).
year
as a factorI will illustrate, but for reasons to be explained later, I'm using
fyear
, a factor version ofyear
:Your calculations are thus correct for the interaction model
msimi
. That's because the fitted values for that model are the cell means. Least-squares means are averages of predictions, equally weighted. Here they are obtained by hand for the additive model,msim
:First, obtain the predictions:
Now, average those predictions:
These are the same as the results that
LSmeans
obtained.year
as numericThere is a subtle difference when we use the model with
year
as a numeric predictor:Least-squares means are obtained from a "reference grid" defined by the model. The lsmeans package allows obtaining that reference grid explicitly:
As you see, the reference grid for
msim
consists of the four combinations of the two factors, whereas the reference grid formsimq
has only two points, withyear
set to its average. The least-squares means fortreat
are the same for both models, because the linear effect ofyear
is used, which implies that the mean at the average is the average of the means of years 1 and 2.Summary
To understand least-squares means correctly, focus on the fact that they are based on predictions from a model -- not directly on data without a model context.
You might want to take a look at the documentation and vignettes in the lsmeans package, which has more comprehensive support for obtaining least-squares means from various models.