Confidence Intervals – Are 50% Confidence Intervals More Robustly Estimated Than 95%?

assumptionsconfidence intervalrobust

My question flows out of this comment on an Andrew Gelman's blog post in which he advocates the use of 50% confidence intervals instead of 95% confidence intervals, although not on the grounds that they are more robustly estimated:

I prefer 50% to 95% intervals for 3 reasons:

  1. Computational stability,

  2. More intuitive evaluation (half the 50% intervals should contain the true value),

  3. A sense that in aplications it’s best to get a sense of where the parameters and predicted values will be, not to attempt an unrealistic near-certainty.

The commenter's idea seems to be that problems with the assumptions underlying the construction of the confidence interval will have more an impact if it's a 95% CI than if it's a 50% CI. However, he doesn't really explain why.

[…] as you go to larger intervals, you become more sensitive in general to details or assumptions of your model. For example, you would never believe that you had correctly identified the 99.9995% interval. Or at least that’s my intuition. If it’s right, it argues that 50-percent should be better estimated than 95-percent. Or maybe “more robustly” estimated, since it is less sensitive to assumptions about the noise, perhaps?

Is it true? Why/why not?

Best Answer

This answer analyzes the meaning of the quotation and offers the results of a simulation study to illustrate it and help understand what it might be trying to say. The study can easily be extended by anybody (with rudimentary R skills) to explore other confidence interval procedures and other models.

Two interesting issues emerged in this work. One concerns how to evaluate the accuracy of a confidence interval procedure. The impression one gets of robustness depends on that. I display two different accuracy measures so you can compare them.

The other issue is that although a confidence interval procedure with low confidence may be robust, the corresponding confidence limits might not be robust at all. Intervals tend to work well because the errors they make at one end often counterbalance the errors they make at the other. As a practical matter, you can be pretty sure that around half of your $50\%$ confidence intervals are covering their parameters, but the actual parameter might consistently lie near one particular end of each interval, depending on how reality departs from your model assumptions.


Robust has a standard meaning in statistics:

Robustness generally implies insensitivity to departures from assumptions surrounding an underlying probabilistic model.

(Hoaglin, Mosteller, and Tukey, Understanding Robust and Exploratory Data Analysis. J. Wiley (1983), p. 2.)

This is consistent with the quotation in the question. To understand the quotation we still need to know the intended purpose of a confidence interval. To this end, let's review what Gelman wrote.

I prefer 50% to 95% intervals for 3 reasons:

  1. Computational stability,

  2. More intuitive evaluation (half the 50% intervals should contain the true value),

  3. A sense that in applications it’s best to get a sense of where the parameters and predicted values will be, not to attempt an unrealistic near-certainty.

Since getting a sense of predicted values is not what confidence intervals (CIs) are intended for, I will focus on getting a sense of parameter values, which is what CIs do. Let's call these the "target" values. Whence, by definition, a CI is intended to cover its target with a specified probability (its confidence level). Achieving intended coverage rates is the minimum criterion for evaluating the quality of any CI procedure. (Additionally, we might be interested in typical CI widths. To keep the post to a reasonable length, I will ignore this issue.)

These considerations invite us to study how much a confidence interval calculation could mislead us concerning the target parameter value. The quotation could be read as suggesting that lower-confidence CIs might retain their coverage even when the data are generated by a process different than the model. That's something we can test. The procedure is:

  • Adopt a probability model that includes at least one parameter. The classic one is sampling from a Normal distribution of unknown mean and variance.

  • Select a CI procedure for one or more of the model's parameters. An excellent one constructs the CI from the sample mean and sample standard deviation, multiplying the latter by a factor given by a Student t distribution.

  • Apply that procedure to various different models--departing not too much from the adopted one--to assess its coverage over a range of confidence levels.

As an example, I have done just that. I have allowed the underlying distribution to vary across a wide range, from almost Bernoulli, to Uniform, to Normal, to Exponential, and all the way to Lognormal. These include symmetric distributions (the first three) and strongly skewed ones (the last two). For each distribution I generated 50,000 samples of size 12. For each sample I constructed two-sided CIs of confidence levels between $50\%$ and $99.8\%$, which covers most applications.

An interesting issue now arises: How should we measure how well (or how badly) a CI procedure is performing? A common method simply evaluates the difference between the actual coverage and the confidence level. This can look suspiciously good for high confidence levels, though. For instance, if you are trying to achieve 99.9% confidence but you get only 99% coverage, the raw difference is a mere 0.9%. However, that means your procedure fails to cover the target ten times more often than it should! For this reason, a more informative way of comparing coverages ought to use something like odds ratios. I use differences of logits, which are the logarithms of odds ratios. Specifically, when the desired confidence level is $\alpha$ and the actual coverage is $p$, then

$$\log\left(\frac{p}{1-p}\right) - \log\left(\frac{\alpha}{1-\alpha}\right)$$

nicely captures the difference. When it is zero, the coverage is exactly the value intended. When it is negative, the coverage is too low--which means the CI is too optimistic and underestimates the uncertainty.

The question, then, is how do these error rates vary with confidence level as the underlying model is perturbed? We can answer it by plotting the simulation results. These plots quantify how "unrealistic" the "near-certainty" of a CI might be in this archetypal application.

Figure

The graphics show the same results, but the one at the left displays the values on logit scales while the one at the right uses raw scales. The Beta distribution is a Beta$(1/30,1/30)$ (which is practically a Bernoulli distribution). The lognormal distribution is the exponential of the standard Normal distribution. The normal distribution is included to verify that this CI procedure really does attain its intended coverage and to reveal how much variation to expect from the finite simulation size. (Indeed, the graphs for the normal distribution are comfortably close to zero, showing no significant deviations.)

It is clear that on the logit scale, the coverages grow more divergent as the confidence level increases. There are some interesting exceptions, though. If we are unconcerned with perturbations of the model that introduce skewness or long tails, then we can ignore the exponential and lognormal and focus on the rest. Their behavior is erratic until $\alpha$ exceeds $95\%$ or so (a logit of $3$), at which point the divergence has set in.

This little study brings some concreteness to Gelman's claim and illustrates some of the phenomena he might have had in mind. In particular, when we are using a CI procedure with a low confidence level, such as $\alpha=50\%$, then even when the underlying model is strongly perturbed, it looks like the coverage will still be close to $50\%$: our feeling that such a CI will be correct about half the time and incorrect the other half is borne out. That is robust. If instead we are hoping to be right, say, $95\%$ of the time, which means we really want to be wrong only $5\%$ of the time, then we should be prepared for our error rate to be much greater in case the world doesn't work quite the way our model supposes.

Incidentally, this property of $50\%$ CIs holds in large part because we are studying symmetric confidence intervals. For the skewed distributions, the individual confidence limits can be terrible (and not robust at all), but their errors often cancel out. Typically one tail is short and the other long, leading to over-coverage at one end and under-coverage at the other. I believe that $50\%$ confidence limits will not be anywhere near as robust as the corresponding intervals.


This is the R code that produced the plots. It is readily modified to study other distributions, other ranges of confidence, and other CI procedures.

#
# Zero-mean distributions.
#
distributions <- list(Beta=function(n) rbeta(n, 1/30, 1/30) - 1/2,
                      Uniform=function(n) runif(n, -1, 1),
                      Normal=rnorm, 
                      #Mixture=function(n) rnorm(n, -2) + rnorm(n, 2),
                      Exponential=function(n) rexp(n) - 1,
                      Lognormal=function(n) exp(rnorm(n, -1/2)) - 1
)
n.sample <- 12
n.sim <- 5e4
alpha.logit <- seq(0, 6, length.out=21); alpha <- signif(1 / (1 + exp(-alpha.logit)), 3)
#
# Normal CI.
#
CI <- function(x, Z=outer(c(1,-1), qt((1-alpha)/2, n.sample-1))) 
  mean(x) + Z * sd(x) / sqrt(length(x))
#
# The simulation.
#
#set.seed(17)
alpha.s <- paste0("alpha=", alpha)
sim <- lapply(distributions, function(dist) {
  x <- matrix(dist(n.sim*n.sample), n.sample)
  x.ci <- array(apply(x, 2, CI), c(2, length(alpha), n.sim),
                dimnames=list(Endpoint=c("Lower", "Upper"),
                              Alpha=alpha.s,
                              NULL))
  covers <- x.ci["Lower",,] * x.ci["Upper",,] <= 0
  rowMeans(covers)
})
(sim)
#
# The plots.
#
logit <- function(p) log(p/(1-p))
colors <- hsv((1:length(sim)-1)/length(sim), 0.8, 0.7)
par(mfrow=c(1,2))         
plot(range(alpha.logit), c(-2,1), type="n", 
     main="Confidence Interval Accuracies (Logit Scales)", cex.main=0.8,
     xlab="Logit(alpha)", 
     ylab="Logit(coverage) - Logit(alpha)")
abline(h=0, col="Gray", lwd=2)
legend("bottomleft", names(sim), col=colors, lwd=2, bty="n", cex=0.8)
for(i in 1:length(sim)) {
  coverage <- sim[[i]]
  lines(alpha.logit, logit(coverage) - alpha.logit, col=colors[i], lwd=2)
}

plot(range(alpha), c(-0.2, 0.05), type="n", 
     main="Raw Confidence Interval Accuracies", cex.main=0.8,
     xlab="alpha", 
     ylab="coverage-alpha")
abline(h=0, col="Gray", lwd=2)
legend("bottomleft", names(sim), col=colors, lwd=2, bty="n", cex=0.8)
for(i in 1:length(sim)) {
  coverage <- sim[[i]]
  lines(alpha, coverage - alpha, col=colors[i], lwd=2)
}
Related Question