Solved – Confidence intervals for group means (R)

confidence intervalmeanr

Here are some sample data in R:

set.seed(42)
df <- data.frame(g = factor(rep(1:2, each= 50)), y = rnorm(100)+rep(0:1, each=50))

One can easily get group means using e.g. with(df, tapply(y,g,mean)) but there is no such easy way to get the confidence intervals for group means. This is why I tried:

lm(y ~ g-1, df)
# correct group means:
# -0.03567   1.10070 
confint(lm(y~g-1, df))
# too narrow CI's
#         2.5 %    97.5 %
# g1 -0.3287751 0.2574315
# g2  0.8075981 1.3938047

That is, one can estimate the group means using a linear model with dummy group indicators, omitting the overall intercept. But the confidence intervals of these regression parameters are narrower than the confidence intervals of group means. The latter could be found group by group with the same function:

confint(lm(y~1, data=df, subset=g==1))
#                  2.5 %    97.5 %
# (Intercept) -0.3629177 0.2915742

Or a manual check using textbook formulae:

ci.mean <- function(x, alfa=0.05){
   n <- length(x)
   a<-qt(1-alfa/2, n-1)
   m<-mean(x);          s<-sd(x)
   se<-s/sqrt(n)
   res <- c(m, m-a*se,m+a*se)
   setNames(res, c("mean", paste(100*alfa/2, "%"), paste(100*(1-alfa/2), "%")))
}
ci.mean(with(df, y[g==1])
#        mean       2.5 %      97.5 % 
# -0.03567178 -0.36291775  0.29157418 

There is probably an easy answer to the question of why the CIs of seemingly the same parameters are different. (The answer, obviously, has to start with difference in standard errors.) But I would be interested in the interpretation: why is that I can trust the group means found with lm(y~g-1) but I can't trust the confidence intervals around those "means" found with confint(lm(y~g-1))? And another naive question, why is the standard error for a group mean smaller if another group is present? That is:

coef(summary(lm(y~g-1, df)))[1,2]
# [1] 0.1476987
coef(summary(lm(y~1, df, subset=g==1)))[1,2]
# [1] 0.1628433

Again, I am more interested in the substantial interpretation than the formula showing why this is so. (I suppose after some sleep I could figure out the formula but would still be in trouble with interpretation).

Thanks in advance!

Best Answer

The answer to your "naive question" contains the solution to your problem.

In the linear model on all the data, the residual variance is estimated from all 100 data points, based on the difference of each value from its associated group mean. Thus you will note that the difference between the top and bottom CI is the same for both groups (0.5862 for your data set).

In the subset analysis, only the data points in the selected group are used, so one would expect a different CI range than in the pooled analysis. In this case the CI range is larger, 0.6545.

If you had analyzed the group 2 subset separately instead, its individual CI would have been less than its CI in the pooled analysis:

confint(lm(y~1, data=df, subset=g==2))
                 2.5 %   97.5 %
 (Intercept) 0.8378242 1.363579

The CI range here is only 0.5258.

One group analyzed individually has a narrower CI band than in pooled analysis, one has a wider band when analyzed individually. The pooling of variance estimates in the combined linear model explains your results.