Solved – Different confidence intervals from direct calculation and R’s confint function

confidence intervalmultiple-comparisonsr

I am interested in pairwise comparison of means, through the calculation of confidence intervals (CI) of the difference between two given sample's means. However, CI calculated directly from each sample, or through the confint(lm(...)) function, are different. This leads to different conclusions regarding, first, whether a given mean is significantly different from zero or not, second, whether two given means are significantly different or not…

Some possible R code to observe this:

# sample preparation...
n <- 4
trt <- rep(c("a","b","c","d","e"),each=n)
X <- c()
for(i in 1:length(unique(trt))){
    X <- c(X,rnorm(n,i,1))
}
# plot(X~factor(trt))

# calcul of 95% CI for each level of trt
trt_sd <- c()
trt_mean <- c()
for(i in 1:length(unique(trt))){
    Xi <- X[which(trt==sort(unique(trt))[i])]
    trt_sd <- c(trt_sd,sd(Xi))
    trt_mean <- c(trt_mean,mean(Xi))
}
trt_ci95 <- qt(.975,(n-1))*trt_sd/sqrt(n)

CIs_direct <- data.frame(trt=sort(unique(trt)), low2.5=(trt_mean-trt_ci95), high97.5=(trt_mean+trt_ci95))

# calcul of 95% CI for each level through the confint() function
CIs_fromLM <- confint(lm(X~0+trt))

Which value of CI should I use for pairwise comparison: that of direct calculation, or that given by the confint function?

Edit: I have the feeling that when $n$ increases, agreement between the procedures improves. However I am facing a low $n$, so this question remains.

Thanks in advance.

Best Answer

The two sets of confidence intervals are based on different assumptions about the data and therefore should not agree.

Analysis

Buried in the R code are two distinct models for $X,Y$ data where $X$ is a dummy for a "treatment" (variable trt with values "a" through "e") and $Y$ is a response (variable X in the code).

The loop splits the dataset into groups of $Y$-values for each unique instance of $X$ and computes a confidence interval for the underlying mean of each group, yielding one interval per group. Implicitly the model is one where each group will be analyzed as if it were a set of independent samples from a common underlying Normal distribution. We could write

$$Y \sim \mathcal{N}(\mu_\text{treatment}, \sigma^2_\text{treatment}).$$
The Ordinary Least Squares regression performed by lm supposes that the response $Y$ is in the form

$$Y \sim \mathcal{N}(\mu_\text{treatment}, \sigma^2).$$

The second differs from the first in how it models the variation: by including all the data in a single regression model, lm assumes all the variances are the same. By splitting the data into separate models, the loop imposes no conditions on the variances.

Recommendations

If you have reasons to suppose the true underlying variation around the mean in each group should be the same, then you would want to "borrow strength" by using the full regression model because it provides a more precise estimate of that common variance. If you do not have such reasons, you might (conservatively) want to estimate the variance separately in each group. The price you pay is a set of different confidence intervals--some of which may be shorter, but most of which will be longer.

Ordinary, the way one would proceed would be

Conduct the regression, because it is a simpler and potentially more powerful model.
Check whether the residuals in each treatment group are consistent with the assumption of a constant dispersion, independent of treatment.
- If the check is good, use Analysis of Variance (ANOVA) techniques to conduct tests of the means. (Do not use separate sets of t-tests. They will be interdependent, making it difficult or impossible to interpret the resulting collection of p-values.)
- If the check suggests varying dispersion ("heteroscedasticity"), then adopt the more flexible model used by the "direct" calculation. Tests of the means will be (much) more complicated, involving correlated statistics that do not exactly have Student t distributions.
  
  (There are alternatives to treating each group separately. In many cases one might expect the variance of the treatment group to be some simple, definite function of the treatment group mean. An exposition of this possibility would take us further afield. Good resources include texts on Exporatory Data Analysis, which are concerned (among other things) with identifying and dealing with heteroscedasticity.)

Various diagnostics exist to check heteroscedasticity. A version of Levene's Test for Equality of Variances is often a good choice, because it is robust with respect to departures from Normality of the residuals (which are hard to check with such small group sizes).

Details

Both models will estimate the same set of means $\{\mu_\text{treatment}\,|\, \text{treatment}\in\{a,b,c,d,e\}\}$; their estimates are the sample means. Both models will therefore obtain the same set of residuals (found by subtracting the appropriate mean from each observation, depending on what treatment it is associated with). Consequently we should be able to relate their estimates of the variances. In the loop, the variances are the sums of squared residuals, divided by a constant (equal to the number of observations of the treatment, minus one). In the regression, the variance is the sum of squares of all residuals, divided by a constant (equal to the total number of observations of the treatment, minus the count of treatments).

Let's examine these quantities by undoing the divisions so we can get at the sums of squares:

fit <- lm(X~0+trt)             # Store the regression results
sum(resid(fit)^2)              # Display the sum of squares of residuals (SSR)
cat((table(trt)-1) * trt_sd^2) # Display the SSR for each treatment in the loop

The output is

[1] 7.997896
3.488526 0.2166572 1.280409 2.758286 0.2540169

The second line of values does add up to the first.

When computing a confidence limit, a multiple of the estimated standard deviation will be added to the mean. The multiple consists of two factors: a quantile of a Student t distribution, divided by the square root of a count. This is the formula used to compute trt_ci95, the "direct" CI. The division by the root count converts an estimated standard deviation into a standard error. This will be the same factor in either model. However, the degrees of freedom parameter in the regression will be larger (15 instead of 3). This will cause the multiple for the regression to be smaller than that used for the "direct" calculation. This difference does indeed become negligible as the group sizes grow larger, as suggested in the question.

Finally, here is a way to obtain the lm confidence intervals from the outputs of the loop. It sums the squared residuals to estimate the common standard deviation and uses the group counts (n) to convert that SD to the standard error of each mean:

df <- length(trt) - length(trt_mean)     # Degrees of freedom (15)
outer(trt_mean, qt(c(0.025, .975), df) *
           sqrt(sum((table(trt)-1) * trt_sd^2) / df) / sqrt(n), "+")

This gives exactly the same intervals as CIs_fromLM computed from the lm output.

Related Solutions

Solved – Two-sample $t$-test vs Tukey’s method

It's simply that if (under the null hypothesis of no effects) there's a 5% chance, say, of a false positive in each one of 20 tests, say; there's a greater than 5% chance of a false positive in any one of those 20 tests. If the tests are independent, the change of getting a false positive in none is $95\%^{20}=35.8\%$, and so the chance of a false positive in at least one is $64.2\%$—this is the family-wise error that multiple comparisons tests control. Of course pairwise comparisons of all means after ANOVA are not independent, and Tukey's test takes this into account.

R – Why Confidence Interval from R’s prop.test() Differs from Hand Calculation and SAS Results

The method is not stated verbosely in the details section of ?prop.test but suitable references are given. Wilson's score method is used, see: Wilson EB (1927). "Probable Inference, the Law of Succession, and Statistical Inference." Journal of the American Statistical Association, 22, 209-212.

This is found by Newcombe (1998) - also referenced on ?prop.test - to have much better coverage than the traditional Wald-type interval. See: Newcombe RG (1998). "Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods." Statistics in Medicine, 17, 857-872. There it is called method 3 and 4 (without and with continuity correction, respectively).

Thus, you can replicate the confidence interval

prop.test(319, 1100, conf.level = 0.99, correct = FALSE)$conf.int
## [1] 0.2561013 0.3264169
## attr(,"conf.level")
## [1] 0.99

with

p <- 319/1100
n <- 1100
z <- qnorm(0.995)
(2 * n * p + z^2 + c(-1, 1) * z * sqrt(z^2 + 4 * n * p * (1 - p))) /
(2 * (n + z^2))
## [1] 0.2561013 0.3264169

Of course, the "exact" binomial (Clopper & Pearson), discussed as method 5 in Newcombe (1998), is also available in binom.test:

binom.test(319, 1100, conf.level = 0.99)$conf.int
## [1] 0.2552831 0.3265614
## attr(,"conf.level")
## [1] 0.99