Here's a link of the past question.
Although I'm so beginner of statistic, I keep trying to understand statistic.
The past question is about significantly unequal sample size.
After that, I would like to visualize the mean with standard error to show no statistical difference among the groups. However, I found the standard error in the lsmeans's results table is different from the manually calculated standard error. I assume this is because of unequal sample size, but it is still unclear.
I might not fully understand behind of lsmeans processes though.
Could anyone let me know why these are different? Or give me information about that?
Any advice, or criticism for my question are welcome!
Thanks!
Below is my code.
three groups having extremely unequal sample size: a – 6, b – 30, c – 6
a <- data.frame("total" = c(180.3946, 184.5053, 174.7285, 176.7839, 168.2292, 171.951), "cond" = "a")
b <- data.frame("total" = c(183.4105,186.4333,178.9715,246.7047,231.7752,169.827,152.21,179.58,133.12,115.18,195.45,102.07,198.0954,242.6217,283.9676,388.9224,236.2608,210.8172,367.2511,374.014,366.124,367.2511,465.7633,396.5568,173.8551,101.9857,156.1761,171.3417,248.2407,206.0161),
"cond" = "b")
c <- data.frame("total" = c(291.6284,280.7974,212.986,271.6146,276.5592,232.7643), "cond" = "c")
combine into one data.frame
total.data <- rbind(a,b,c)
run model to see the effect
model <- lm(total ~ cond, total.data)
load library to do pairwise comparison
library(lsmeans)
lsmeans(model, pairwise~cond)
function for standard error
st.err <- function(x) {
sd(x)/sqrt(length(x))
}
calculate standard error using tapply
with(total.data,tapply(total, cond, st.err))
comparing between two SEs
lsmeans(model, pairwise~cond) [1]
$lsmeans
cond lsmean SE df lower.CL upper.CL
a 176.0988 34.77988 39 105.7498 246.4477
b 234.3331 15.55403 39 202.8721 265.7941
c 261.0583 34.77988 39 190.7094 331.4073
#
Confidence level used: 0.95
with(total.data,tapply(total, cond, st.err))
Best Answer
The distinction is that you fitted a model that pools the SDs together into one common value. To show more precisely, first let's get some stats:
Note that the sample SDs are quite different from one another, which is part of the reason for your confusion.
Now, the standard errors you calculated are based on the individual SDs:
But the ones that
lsmeans
outputs are based on one SD (the pooled SD, which is obtained from a weighted average of the variances):The latter results match those from the
lsmeans
output. Note that the largests
corresponds to the largestn
, giving it the most weight in calculatings.p
. This makes the pooled SD quite a bit larger than the sample SDs for conditions "a" and "c".