Solved – R: Calculating mean and standard error of mean for factors with lm() vs. direct calculation -edited

categorical datalmmeanr

When dealing with data with factors R can be used to calculate the means for each group with the lm() function. This also gives the standard errors for the estimated means. But this standard error differs from what I get from a calculation by hand.

Here is an example (taken from here Predicting the difference between two groups in R )

First calculate the mean with lm():

    mtcars$cyl <- factor(mtcars$cyl)
    mylm <- lm(mpg ~ cyl, data = mtcars)
    summary(mylm)$coef

                Estimate Std. Error   t value     Pr(>|t|)
  (Intercept)  26.663636  0.9718008 27.437347 2.688358e-22
  cyl6         -6.920779  1.5583482 -4.441099 1.194696e-04
  cyl8        -11.563636  1.2986235 -8.904534 8.568209e-10

The intercept is the mean for the first group, the 4 cylindered cars.
To get the means by direct calculation I use this:

  with(mtcars, tapply(mpg, cyl, mean))

         4        6        8 
    26.66364 19.74286 15.10000 

To get the standard errors for the means I calculate the sample standard variation and divide by the number of observations in each group:

 with(mtcars, tapply(mpg, cyl, sd)/sqrt(summary(mtcars$cyl)) )

         4         6         8 
   1.3597642 0.5493967 0.6842016 

The direct calculation gives the same mean but the standard error is different for the 2 approaches, I had expected to get the same standard error.
What is going on here? It is related to lm() fitting the mean for each group and an error term?

Edited:
After Svens answer (below) I can formulate my question more concise and clearly.

For categorical data we can calculate the means of a variable for different groups is by using lm() without an intercept.

  mtcars$cyl <- factor(mtcars$cyl)
  mylm <- lm(mpg ~ cyl, data = mtcars)
  summary(mylm)$coef

      Estimate Std. Error
  cyl4 26.66364  0.9718008
  cyl6 19.74286  1.2182168
  cyl8 15.10000  0.8614094

We can compare this with an direct calculation of the means and their standard errors:

  with(mtcars, tapply(mpg, cyl, mean))

         4        6        8 
    26.66364 19.74286 15.10000 

  with(mtcars, tapply(mpg, cyl, sd)/sqrt(summary(mtcars$cyl)) )

         4         6         8 
   1.3597642 0.5493967 0.6842016 

The means are exactly the same but the standard errors are different for these 2 methods (as Sven also notices). My question is why are they different and not the same?

(when editing my question, should I delete the original text or adding my edition as I did )

Best Answer

The difference in standard errors are because in the regression you compute a combined estimate of the variance, while in the other calculation you compute separate estimates of the variance.