Solved – Confidence intervals for group means (R)

confidence intervalmeanr

Here are some sample data in R:

set.seed(42)
df <- data.frame(g = factor(rep(1:2, each= 50)), y = rnorm(100)+rep(0:1, each=50))

One can easily get group means using e.g. with(df, tapply(y,g,mean)) but there is no such easy way to get the confidence intervals for group means. This is why I tried:

lm(y ~ g-1, df)
# correct group means:
# -0.03567   1.10070 
confint(lm(y~g-1, df))
# too narrow CI's
#         2.5 %    97.5 %
# g1 -0.3287751 0.2574315
# g2  0.8075981 1.3938047

That is, one can estimate the group means using a linear model with dummy group indicators, omitting the overall intercept. But the confidence intervals of these regression parameters are narrower than the confidence intervals of group means. The latter could be found group by group with the same function:

confint(lm(y~1, data=df, subset=g==1))
#                  2.5 %    97.5 %
# (Intercept) -0.3629177 0.2915742

Or a manual check using textbook formulae:

ci.mean <- function(x, alfa=0.05){
   n <- length(x)
   a<-qt(1-alfa/2, n-1)
   m<-mean(x);          s<-sd(x)
   se<-s/sqrt(n)
   res <- c(m, m-a*se,m+a*se)
   setNames(res, c("mean", paste(100*alfa/2, "%"), paste(100*(1-alfa/2), "%")))
}
ci.mean(with(df, y[g==1])
#        mean       2.5 %      97.5 % 
# -0.03567178 -0.36291775  0.29157418

There is probably an easy answer to the question of why the CIs of seemingly the same parameters are different. (The answer, obviously, has to start with difference in standard errors.) But I would be interested in the interpretation: why is that I can trust the group means found with lm(y~g-1) but I can't trust the confidence intervals around those "means" found with confint(lm(y~g-1))? And another naive question, why is the standard error for a group mean smaller if another group is present? That is:

coef(summary(lm(y~g-1, df)))[1,2]
# [1] 0.1476987
coef(summary(lm(y~1, df, subset=g==1)))[1,2]
# [1] 0.1628433

Again, I am more interested in the substantial interpretation than the formula showing why this is so. (I suppose after some sleep I could figure out the formula but would still be in trouble with interpretation).

Thanks in advance!

Best Answer

The answer to your "naive question" contains the solution to your problem.

In the linear model on all the data, the residual variance is estimated from all 100 data points, based on the difference of each value from its associated group mean. Thus you will note that the difference between the top and bottom CI is the same for both groups (0.5862 for your data set).

In the subset analysis, only the data points in the selected group are used, so one would expect a different CI range than in the pooled analysis. In this case the CI range is larger, 0.6545.

If you had analyzed the group 2 subset separately instead, its individual CI would have been less than its CI in the pooled analysis:

confint(lm(y~1, data=df, subset=g==2))
                 2.5 %   97.5 %
 (Intercept) 0.8378242 1.363579

The CI range here is only 0.5258.

One group analyzed individually has a narrower CI band than in pooled analysis, one has a wider band when analyzed individually. The pooling of variance estimates in the combined linear model explains your results.

Related Solutions

R – How to Make Inferences About Group Means Using Credible Intervals vs Confidence Intervals

In the answers there (if I understood correctly) I learned that within-subject variance does not effect inferences made about group means and it is ok to simply take the averages of averages to calculate group mean, then calculate within-group variance and use that to perform significance tests.

Let me develop this idea here. The model for the individual observations is $$y_{ijk}= \mu_i + \alpha_{ij} + \epsilon_{ijk}$$, where :

$y_{ijk}$ is the $k$-th measurement of individual $j$ of group $i$
$\alpha_{ij} \sim_{\text{iid}} {\cal N}(0, \sigma^2_b)$ is the random effect for individual $j$ of group $i$
$\epsilon_{ijk} \sim_{\text{iid}} {\cal N}(0, \sigma^2_w)$ is the within-error

In my answer to your first question, I have suggested you to note that one obtains a classical (fixed effects) Gaussian linear model for the subjects means $\bar y_{ij\bullet}$. Indeed you can easily check that $$\bar y_{ij\bullet} = \mu_i + \delta_{ij}$$ with $$\delta_{ij} = \alpha_{ij} + \frac{1}{K}\sum_k \epsilon_{ijk} \sim_{\text{iid}} {\cal N}(0, \sigma^2) \quad \text{where } \quad \boxed{\sigma^2=\sigma^2_b+\frac{\sigma^2_w}{K}},$$ assuming $K$ repeated measurements for each individual. This is nothing but the one-way ANOVA model with a fixed factor.

And then I claimed that in order to draw inference about the $\mu_i$ you can simply consider the simple classical linear model whose observations are the subjects means $\bar y_{ij\bullet}$. Update 12/04/2014: Some examples of this idea are now written on my blog: Reducing a model to get confidence intervals. I'm under the impression that this always work when we average the data over the levels of a random effect.

In the answers there (if I understood correctly) I learned that within-subject variance does not effect inferences made about group means and it is ok to simply take the averages of averages to calculate group mean, then calculate within-group variance and use that to perform significance tests. I would like to use a method where the larger the within subject variance the less sure I am about the group means or understand why it does not make sense to desire that.

As you see from the boxed formula, the within-variance $\sigma^2_w$ plays a role in the model for the observed group means.

Solved – Why is the CI in the median larger then the CI in the mean

The mean is an ordinary least squares estimator, and so it's BLUE - it's the best linear unbiased estimator. Best, in this case means that it has the smallest sampling variance of any unbiased estimator (including the median).

The proof is on the Wikipedia page for Gauss-Markov theorem: http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem

Another way to think about it is that the mean uses all of the information available in the sample, the median is just one value.

Another way to look at it is to run a simulation, calculating the median and mean for a large number of samples, and to look at their standard deviation:

> sd(tapply(runif(100000), rep(1:1000, 100), mean))
[1] 0.02935391
> sd(tapply(runif(100000), rep(1:1000, 100), median))
[1] 0.04848859

The mean has a smaller variance (sd), which means it has narrower confidence intervals.