Solved – Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean

confidence intervalmeanpopulationprobabilitysampling

It seems that through various related questions here, there is consensus that the "95%" part of what we call a "95% confidence interval" refers to the fact that if we were to exactly replicate our sampling and CI-computation procedures many times, 95% of thusly computed CIs would contain the population mean. It also seems to be the consensus that this definition does not permit one to conclude from a single 95%CI that there is a 95% chance that the mean falls somewhere within the CI. However, I don't understand how the former doesn't imply the latter insofar as, having imagined many CIs 95% of which contain the population mean, shouldn't our uncertainty (with regards to whether our actually-computed CI contains the population mean or not) force us to use the base-rate of the imagined cases (95%) as our estimate of the probability that our actual case contains the CI?

I've seen posts argue along the lines of "the actually-computed CI either contains the population mean or it doesn't, so its probability is either 1 or 0", but this seems to imply a strange definition of probability that is dependent on unknown states (i.e. a friend flips fair coin, hides the result, and I am disallowed from saying there is a 50% chance that it's heads).

Surely I'm wrong, but I don't see where my logic has gone awry…

Best Answer

Part of the issue is that the frequentist definition of a probability doesn't allow a nontrivial probability to be applied to the outcome of a particular experiment, but only to some fictitious population of experiments from which this particular experiment can be considered a sample. The definition of a CI is confusing as it is a statement about this (usually) fictitious population of experiments, rather than about the particular data collected in the instance at hand. So part of the issue is one of the definition of a probability: The idea of the true value lying within a particular interval with probability 95% is inconsistent with a frequentist framework.

Another aspect of the issue is that the calculation of the frequentist confidence doesn't use all of the information contained in the particular sample relevant to bounding the true value of the statistic. My question "Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals" discusses a paper by Edwin Jaynes which has some really good examples that really highlight the difference between confidence intervals and credible intervals. One that is particularly relevant to this discussion is Example 5, which discusses the difference between a credible and a confidence interval for estimating the parameter of a truncated exponential distribution (for a problem in industrial quality control). In the example he gives, there is enough information in the sample to be certain that the true value of the parameter lies nowhere in a properly constructed 90% confidence interval!

This may seem shocking to some, but the reason for this result is that confidence intervals and credible intervals are answers to two different questions, from two different interpretations of probability.

The confidence interval is the answer to the request: "Give me an interval that will bracket the true value of the parameter in $100p$% of the instances of an experiment that is repeated a large number of times." The credible interval is an answer to the request: "Give me an interval that brackets the true value with probability $p$ given the particular sample I've actually observed." To be able to answer the latter request, we must first adopt either (a) a new concept of the data generating process or (b) a different concept of the definition of probability itself.

The main reason that any particular 95% confidence interval does not imply a 95% chance of containing the mean is because the confidence interval is an answer to a different question, so it is only the right answer when the answer to the two questions happens to have the same numerical solution.

In short, credible and confidence intervals answer different questions from different perspectives; both are useful, but you need to choose the right interval for the question you actually want to ask. If you want an interval that admits an interpretation of a 95% (posterior) probability of containing the true value, then choose a credible interval (and, with it, the attendant conceptualization of probability), not a confidence interval. The thing you ought not to do is to adopt a different definition of probability in the interpretation than that used in the analysis.

Thanks to @cardinal for his refinements!

Here is a concrete example, from David MaKay's excellent book "Information Theory, Inference and Learning Algorithms" (page 464):

Let the parameter of interest be $\theta$ and the data $D$, a pair of points $x_1$ and $x_2$ drawn independently from the following distribution:

$p(x|\theta) = \left\{\begin{array}{cl} 1/2 & x = \theta,\\1/2 & x = \theta + 1, \\ 0 & \mathrm{otherwise}\end{array}\right.$

If $\theta$ is $39$, then we would expect to see the datasets $(39,39)$, $(39,40)$, $(40,39)$ and $(40,40)$ all with equal probability $1/4$. Consider the confidence interval

$[\theta_\mathrm{min}(D),\theta_\mathrm{max}(D)] = [\mathrm{min}(x_1,x_2), \mathrm{max}(x_1,x_2)]$.

Clearly this is a valid 75% confidence interval because if you re-sampled the data, $D = (x_1,x_2)$, many times then the confidence interval constructed in this way would contain the true value 75% of the time.

Now consider the data $D = (29,29)$. In this case the frequentist 75% confidence interval would be $[29, 29]$. However, assuming the model of the generating process is correct, $\theta$ could be 28 or 29 in this case, and we have no reason to suppose that 29 is more likely than 28, so the posterior probability is $p(\theta=28|D) = p(\theta=29|D) = 1/2$. So in this case the frequentist confidence interval is clearly not a 75% credible interval as there is only a 50% probability that it contains the true value of $\theta$, given what we can infer about $\theta$ from this particular sample.

Yes, this is a contrived example, but if confidence intervals and credible intervals were not different, then they would still be identical in contrived examples.

Note the key difference is that the confidence interval is a statement about what would happen if you repeated the experiment many times, the credible interval is a statement about what can be inferred from this particular sample.

Related Question