Solved – R: Calculating mean and standard error of mean for factors with lm() vs. direct calculation -edited

categorical datalmmeanr

When dealing with data with factors R can be used to calculate the means for each group with the lm() function. This also gives the standard errors for the estimated means. But this standard error differs from what I get from a calculation by hand.

Here is an example (taken from here Predicting the difference between two groups in R )

First calculate the mean with lm():

    mtcars$cyl <- factor(mtcars$cyl)
    mylm <- lm(mpg ~ cyl, data = mtcars)
    summary(mylm)$coef

                Estimate Std. Error   t value     Pr(>|t|)
  (Intercept)  26.663636  0.9718008 27.437347 2.688358e-22
  cyl6         -6.920779  1.5583482 -4.441099 1.194696e-04
  cyl8        -11.563636  1.2986235 -8.904534 8.568209e-10

The intercept is the mean for the first group, the 4 cylindered cars.
To get the means by direct calculation I use this:

  with(mtcars, tapply(mpg, cyl, mean))

         4        6        8 
    26.66364 19.74286 15.10000

To get the standard errors for the means I calculate the sample standard variation and divide by the number of observations in each group:

 with(mtcars, tapply(mpg, cyl, sd)/sqrt(summary(mtcars$cyl)) )

         4         6         8 
   1.3597642 0.5493967 0.6842016

The direct calculation gives the same mean but the standard error is different for the 2 approaches, I had expected to get the same standard error.
What is going on here? It is related to lm() fitting the mean for each group and an error term?

Edited:
After Svens answer (below) I can formulate my question more concise and clearly.

For categorical data we can calculate the means of a variable for different groups is by using lm() without an intercept.

  mtcars$cyl <- factor(mtcars$cyl)
  mylm <- lm(mpg ~ cyl, data = mtcars)
  summary(mylm)$coef

      Estimate Std. Error
  cyl4 26.66364  0.9718008
  cyl6 19.74286  1.2182168
  cyl8 15.10000  0.8614094

We can compare this with an direct calculation of the means and their standard errors:

  with(mtcars, tapply(mpg, cyl, mean))

         4        6        8 
    26.66364 19.74286 15.10000 

  with(mtcars, tapply(mpg, cyl, sd)/sqrt(summary(mtcars$cyl)) )

         4         6         8 
   1.3597642 0.5493967 0.6842016

The means are exactly the same but the standard errors are different for these 2 methods (as Sven also notices). My question is why are they different and not the same?

(when editing my question, should I delete the original text or adding my edition as I did )

Best Answer

The difference in standard errors are because in the regression you compute a combined estimate of the variance, while in the other calculation you compute separate estimates of the variance.

Analysis

Buried in the R code are two distinct models for $X,Y$ data where $X$ is a dummy for a "treatment" (variable trt with values "a" through "e") and $Y$ is a response (variable X in the code).

The loop splits the dataset into groups of $Y$-values for each unique instance of $X$ and computes a confidence interval for the underlying mean of each group, yielding one interval per group. Implicitly the model is one where each group will be analyzed as if it were a set of independent samples from a common underlying Normal distribution. We could write

$$Y \sim \mathcal{N}(\mu_\text{treatment}, \sigma^2_\text{treatment}).$$
The Ordinary Least Squares regression performed by lm supposes that the response $Y$ is in the form

$$Y \sim \mathcal{N}(\mu_\text{treatment}, \sigma^2).$$

The second differs from the first in how it models the variation: by including all the data in a single regression model, lm assumes all the variances are the same. By splitting the data into separate models, the loop imposes no conditions on the variances.

Recommendations

If you have reasons to suppose the true underlying variation around the mean in each group should be the same, then you would want to "borrow strength" by using the full regression model because it provides a more precise estimate of that common variance. If you do not have such reasons, you might (conservatively) want to estimate the variance separately in each group. The price you pay is a set of different confidence intervals--some of which may be shorter, but most of which will be longer.

Ordinary, the way one would proceed would be

Conduct the regression, because it is a simpler and potentially more powerful model.
Check whether the residuals in each treatment group are consistent with the assumption of a constant dispersion, independent of treatment.
- If the check is good, use Analysis of Variance (ANOVA) techniques to conduct tests of the means. (Do not use separate sets of t-tests. They will be interdependent, making it difficult or impossible to interpret the resulting collection of p-values.)
- If the check suggests varying dispersion ("heteroscedasticity"), then adopt the more flexible model used by the "direct" calculation. Tests of the means will be (much) more complicated, involving correlated statistics that do not exactly have Student t distributions.
  
  (There are alternatives to treating each group separately. In many cases one might expect the variance of the treatment group to be some simple, definite function of the treatment group mean. An exposition of this possibility would take us further afield. Good resources include texts on Exporatory Data Analysis, which are concerned (among other things) with identifying and dealing with heteroscedasticity.)

Various diagnostics exist to check heteroscedasticity. A version of Levene's Test for Equality of Variances is often a good choice, because it is robust with respect to departures from Normality of the residuals (which are hard to check with such small group sizes).

Details

Both models will estimate the same set of means $\{\mu_\text{treatment}\,|\, \text{treatment}\in\{a,b,c,d,e\}\}$; their estimates are the sample means. Both models will therefore obtain the same set of residuals (found by subtracting the appropriate mean from each observation, depending on what treatment it is associated with). Consequently we should be able to relate their estimates of the variances. In the loop, the variances are the sums of squared residuals, divided by a constant (equal to the number of observations of the treatment, minus one). In the regression, the variance is the sum of squares of all residuals, divided by a constant (equal to the total number of observations of the treatment, minus the count of treatments).

Let's examine these quantities by undoing the divisions so we can get at the sums of squares:

fit <- lm(X~0+trt)             # Store the regression results
sum(resid(fit)^2)              # Display the sum of squares of residuals (SSR)
cat((table(trt)-1) * trt_sd^2) # Display the SSR for each treatment in the loop

The output is

[1] 7.997896
3.488526 0.2166572 1.280409 2.758286 0.2540169

The second line of values does add up to the first.

When computing a confidence limit, a multiple of the estimated standard deviation will be added to the mean. The multiple consists of two factors: a quantile of a Student t distribution, divided by the square root of a count. This is the formula used to compute trt_ci95, the "direct" CI. The division by the root count converts an estimated standard deviation into a standard error. This will be the same factor in either model. However, the degrees of freedom parameter in the regression will be larger (15 instead of 3). This will cause the multiple for the regression to be smaller than that used for the "direct" calculation. This difference does indeed become negligible as the group sizes grow larger, as suggested in the question.

Finally, here is a way to obtain the lm confidence intervals from the outputs of the loop. It sums the squared residuals to estimate the common standard deviation and uses the group counts (n) to convert that SD to the standard error of each mean:

df <- length(trt) - length(trt_mean)     # Degrees of freedom (15)
outer(trt_mean, qt(c(0.025, .975), df) *
           sqrt(sum((table(trt)-1) * trt_sd^2) / df) / sqrt(n), "+")

This gives exactly the same intervals as CIs_fromLM computed from the lm output.

Solved – Confidence intervals for group means (R)

The answer to your "naive question" contains the solution to your problem.

In the linear model on all the data, the residual variance is estimated from all 100 data points, based on the difference of each value from its associated group mean. Thus you will note that the difference between the top and bottom CI is the same for both groups (0.5862 for your data set).

In the subset analysis, only the data points in the selected group are used, so one would expect a different CI range than in the pooled analysis. In this case the CI range is larger, 0.6545.

If you had analyzed the group 2 subset separately instead, its individual CI would have been less than its CI in the pooled analysis:

confint(lm(y~1, data=df, subset=g==2))
                 2.5 %   97.5 %
 (Intercept) 0.8378242 1.363579

The CI range here is only 0.5258.

One group analyzed individually has a narrower CI band than in pooled analysis, one has a wider band when analyzed individually. The pooling of variance estimates in the combined linear model explains your results.

Best Answer

Related Solutions

Solved – Different confidence intervals from direct calculation and R’s confint function

Analysis

Recommendations

Details

Solved – Confidence intervals for group means (R)

Related Question