Solved – What’s difference between SE and lsmean’s SE

lsmeanssample-sizestandard error

Here's a link of the past question.

Although I'm so beginner of statistic, I keep trying to understand statistic.

The past question is about significantly unequal sample size.

After that, I would like to visualize the mean with standard error to show no statistical difference among the groups. However, I found the standard error in the lsmeans's results table is different from the manually calculated standard error. I assume this is because of unequal sample size, but it is still unclear.

I might not fully understand behind of lsmeans processes though.
Could anyone let me know why these are different? Or give me information about that?

Any advice, or criticism for my question are welcome!
Thanks!

Below is my code.

three groups having extremely unequal sample size: a – 6, b – 30, c – 6

a <- data.frame("total" = c(180.3946, 184.5053, 174.7285, 176.7839, 168.2292, 171.951), "cond" = "a")

b <- data.frame("total" = c(183.4105,186.4333,178.9715,246.7047,231.7752,169.827,152.21,179.58,133.12,115.18,195.45,102.07,198.0954,242.6217,283.9676,388.9224,236.2608,210.8172,367.2511,374.014,366.124,367.2511,465.7633,396.5568,173.8551,101.9857,156.1761,171.3417,248.2407,206.0161),
"cond" = "b")

c <- data.frame("total" = c(291.6284,280.7974,212.986,271.6146,276.5592,232.7643), "cond" = "c")

combine into one data.frame

total.data <- rbind(a,b,c)

run model to see the effect

model <- lm(total ~ cond, total.data)

load library to do pairwise comparison

library(lsmeans)

lsmeans(model, pairwise~cond)

function for standard error

st.err <- function(x) {
sd(x)/sqrt(length(x))
}

calculate standard error using tapply

with(total.data,tapply(total, cond, st.err))

comparing between two SEs

lsmeans(model, pairwise~cond) [1]

$lsmeans

cond lsmean SE df lower.CL upper.CL

a 176.0988 34.77988 39 105.7498 246.4477

b 234.3331 15.55403 39 202.8721 265.7941

c 261.0583 34.77988 39 190.7094 331.4073

Confidence level used: 0.95

with(total.data,tapply(total, cond, st.err))

a b c

2.384709 17.878830 12.632462

Best Answer

The distinction is that you fitted a model that pools the SDs together into one common value. To show more precisely, first let's get some stats:

> ( n <- with(total.data, tapply(total, cond, length)) )
 a  b  c 
 6 30  6 

> ( s <- with(total.data, tapply(total, cond, sd)) )
       a        b        c 
 5.84132 97.92638 30.94309

Note that the sample SDs are quite different from one another, which is part of the reason for your confusion.

Now, the standard errors you calculated are based on the individual SDs:

> s / sqrt(n)
        a         b         c 
 2.384709 17.878830 12.632462

But the ones that lsmeans outputs are based on one SD (the pooled SD, which is obtained from a weighted average of the variances):

> ( s.p <- sqrt(sum((n - 1) * s^2) / sum(n - 1)) )
[1] 85.19295

> s.p / sqrt(n)
       a        b        c 
34.77988 15.55403 34.77988

The latter results match those from the lsmeans output. Note that the largest s corresponds to the largest n, giving it the most weight in calculating s.p. This makes the pooled SD quite a bit larger than the sample SDs for conditions "a" and "c".