Solved – Manually Calculating Least Square Means

least squaresrregression

I was working through the example here calculating least squares means by doBy. In order to make sure I understand what is going on, I would like to replicate the analysis manually. I can do it when the number of untreated and treated are the same, but failing to do it when they are not the same.

I am copying the example here:

require(doBy)
zz <- "treat year y
1 t1 1 0.5
2 t1 1 1.0
3 t1 1 1.5
4 t2 1 3.0
5 t1 2 3.0
6 t2 2 4.5
7 t2 2 5.0
8 t2 2 5.5"

simdat <- read.table(text = zz, header =TRUE)
msim <- lm(y ~ treat + year, data=simdat)
LSmeans( msim, effect="treat")

It gives the following:

 estimate        se df    t.stat      p.value treat year
1        2 0.2415229  5  8.280787 4.191542e-04    t1  1.5
2        4 0.2415229  5 16.561573 1.465478e-05    t2  1.5

The estimates can be manually calcualted as follows:

t1_y1 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 1])
t1_y2 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 2])
(t1_y1 + t1_y2)/2
[1] 2

t2_y1 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 1])
t2_y2 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 2])
(t2_y1 + t2_y2)/2
[1] 4

However, when I make it unbalanced, I don't know how to calculate the estimates. For example, as below:

zz <- "treat year y
1 t1 1 0.5
2 t1 1 1.0
3 t1 1 1.5
4 t2 1 3.0
5 t1 2 3.0
6 t2 2 4.5
7 t2 2 5.0
8 t1 2 5.5"

simdat <- read.table(text = zz, header =TRUE)
msim <- lm(y ~ treat + year, data=simdat)
LSmeans( msim, effect="treat")
  estimate        se df   t.stat     p.value treat year
1 2.571429 0.4400255  5 5.843817 0.002076749    t1  1.5
2 3.714286 0.5729889  5 6.482299 0.001302694    t2  1.5

#unweighted means
t1_y1 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 1])
t1_y2 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 2])
(t1_y1 + t1_y2)/2
[1] 2.625

t2_y1 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 1])
t2_y2 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 2])
(t2_y1 + t2_y2)/2
[1] 3.875

#maybe weighted means?
t1_y1 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 1])
t1_y2 <- mean(simdat$y[simdat$treat == "t1" & simdat$year == 2])
(t1_y1*(5/8) + t1_y2 * (3/8))
[1] 2.21875

t2_y1 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 1])
t2_y2 <- mean(simdat$y[simdat$treat == "t2" & simdat$year == 2])
(t2_y1*(5/8) + t2_y2 * (3/8))
[1] 3.65625

How can I manually calculate the least square means?

Best Answer

The issue, which I overlooked initially, is that you fitted an additive model (no interaction).

year as a factor

I will illustrate, but for reasons to be explained later, I'm using fyear, a factor version of year:

> simdat = transform(simdat, fyear = factor(year))

> msim <- lm(y ~ treat + fyear, data = simdat)
> doBy::LSmeans(msim, "treat")
  estimate        se df   t.stat     p.value treat
1 2.571429 0.4400255  5 5.843817 0.002076749    t1
2 3.714286 0.5729889  5 6.482299 0.001302694    t2

> msimi <- lm(y ~ treat * fyear, data = simdat)
> doBy::LSmeans(msimi, "treat")
  estimate        se df   t.stat     p.value treat
1    2.625 0.4419417  4 5.939697 0.004028851    t1
2    3.875 0.5929271  4 6.535374 0.002832366    t2

Your calculations are thus correct for the interaction model msimi. That's because the fitted values for that model are the cell means. Least-squares means are averages of predictions, equally weighted. Here they are obtained by hand for the additive model, msim:

First, obtain the predictions:

> grid = expand.grid(treat = c("t1","t2"), fyear = factor(c(1,2)))
> pred = predict(msim, newdata = grid)
> cbind(grid, pred=pred)
  treat fyear     pred
1    t1     1 1.214286
2    t2     1 2.357143
3    t1     2 3.928571
4    t2     2 5.071429

Now, average those predictions:

> (pred[1] + pred[3])/2
       1 
2.571429 

> (pred[2] + pred[4])/2
       2 
3.714286

These are the same as the results that LSmeans obtained.

year as numeric

There is a subtle difference when we use the model with year as a numeric predictor:

> msimq = lm(y ~ treat + year, data = simdat)

Least-squares means are obtained from a "reference grid" defined by the model. The lsmeans package allows obtaining that reference grid explicitly:

> library(lsmeans)

> summary(ref.grid(msim))
 treat fyear prediction        SE df
 t1    1       1.214286 0.5190258  5
 t2    1       2.357143 0.7340133  5
 t1    2       3.928571 0.6086117  5
 t2    2       5.071429 0.6086117  5

> summary(ref.grid(msimq))
 treat year prediction        SE df
 t1     1.5   2.571429 0.4400255  5
 t2     1.5   3.714286 0.5729889  5

As you see, the reference grid for msim consists of the four combinations of the two factors, whereas the reference grid for msimq has only two points, with year set to its average. The least-squares means for treat are the same for both models, because the linear effect of year is used, which implies that the mean at the average is the average of the means of years 1 and 2.

Summary

To understand least-squares means correctly, focus on the fact that they are based on predictions from a model -- not directly on data without a model context.

You might want to take a look at the documentation and vignettes in the lsmeans package, which has more comprehensive support for obtaining least-squares means from various models.

Related Question