When dealing with data with factors R can be used to calculate the means for each group with the lm() function. This also gives the standard errors for the estimated means. But this standard error differs from what I get from a calculation by hand.
Here is an example (taken from here Predicting the difference between two groups in R )
First calculate the mean with lm():
mtcars$cyl <- factor(mtcars$cyl)
mylm <- lm(mpg ~ cyl, data = mtcars)
summary(mylm)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.663636 0.9718008 27.437347 2.688358e-22
cyl6 -6.920779 1.5583482 -4.441099 1.194696e-04
cyl8 -11.563636 1.2986235 -8.904534 8.568209e-10
The intercept is the mean for the first group, the 4 cylindered cars.
To get the means by direct calculation I use this:
with(mtcars, tapply(mpg, cyl, mean))
4 6 8
26.66364 19.74286 15.10000
To get the standard errors for the means I calculate the sample standard variation and divide by the number of observations in each group:
with(mtcars, tapply(mpg, cyl, sd)/sqrt(summary(mtcars$cyl)) )
4 6 8
1.3597642 0.5493967 0.6842016
The direct calculation gives the same mean but the standard error is different for the 2 approaches, I had expected to get the same standard error.
What is going on here? It is related to lm() fitting the mean for each group and an error term?
Edited:
After Svens answer (below) I can formulate my question more concise and clearly.
For categorical data we can calculate the means of a variable for different groups is by using lm() without an intercept.
mtcars$cyl <- factor(mtcars$cyl)
mylm <- lm(mpg ~ cyl, data = mtcars)
summary(mylm)$coef
Estimate Std. Error
cyl4 26.66364 0.9718008
cyl6 19.74286 1.2182168
cyl8 15.10000 0.8614094
We can compare this with an direct calculation of the means and their standard errors:
with(mtcars, tapply(mpg, cyl, mean))
4 6 8
26.66364 19.74286 15.10000
with(mtcars, tapply(mpg, cyl, sd)/sqrt(summary(mtcars$cyl)) )
4 6 8
1.3597642 0.5493967 0.6842016
The means are exactly the same but the standard errors are different for these 2 methods (as Sven also notices). My question is why are they different and not the same?
(when editing my question, should I delete the original text or adding my edition as I did )
Best Answer
The difference in standard errors are because in the regression you compute a combined estimate of the variance, while in the other calculation you compute separate estimates of the variance.