Part of the issue is that the frequentist definition of a probability doesn't allow a nontrivial probability to be applied to the outcome of a particular experiment, but only to some fictitious population of experiments from which this particular experiment can be considered a sample. The definition of a CI is confusing as it is a statement about this (usually) fictitious population of experiments, rather than about the particular data collected in the instance at hand. So part of the issue is one of the definition of a probability: The idea of the true value lying within a particular interval with probability 95% is inconsistent with a frequentist framework.
Another aspect of the issue is that the calculation of the frequentist confidence doesn't use all of the information contained in the particular sample relevant to bounding the true value of the statistic. My question "Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals" discusses a paper by Edwin Jaynes which has some really good examples that really highlight the difference between confidence intervals and credible intervals. One that is particularly relevant to this discussion is Example 5, which discusses the difference between a credible and a confidence interval for estimating the parameter of a truncated exponential distribution (for a problem in industrial quality control). In the example he gives, there is enough information in the sample to be certain that the true value of the parameter lies nowhere in a properly constructed 90% confidence interval!
This may seem shocking to some, but the reason for this result is that confidence intervals and credible intervals are answers to two different questions, from two different interpretations of probability.
The confidence interval is the answer to the request: "Give me an interval that will bracket the true value of the parameter in $100p$% of the instances of an experiment that is repeated a large number of times." The credible interval is an answer to the request: "Give me an interval that brackets the true value with probability $p$ given the particular sample I've actually observed." To be able to answer the latter request, we must first adopt either (a) a new concept of the data generating process or (b) a different concept of the definition of probability itself.
The main reason that any particular 95% confidence interval does not imply a 95% chance of containing the mean is because the confidence interval is an answer to a different question, so it is only the right answer when the answer to the two questions happens to have the same numerical solution.
In short, credible and confidence intervals answer different questions from different perspectives; both are useful, but you need to choose the right interval for the question you actually want to ask. If you want an interval that admits an interpretation of a 95% (posterior) probability of containing the true value, then choose a credible interval (and, with it, the attendant conceptualization of probability), not a confidence interval. The thing you ought not to do is to adopt a different definition of probability in the interpretation than that used in the analysis.
Thanks to @cardinal for his refinements!
Here is a concrete example, from David MaKay's excellent book "Information Theory, Inference and Learning Algorithms" (page 464):
Let the parameter of interest be $\theta$ and the data $D$, a pair of points $x_1$ and $x_2$ drawn independently from the following distribution:
$p(x|\theta) = \left\{\begin{array}{cl} 1/2 & x = \theta,\\1/2 & x = \theta + 1, \\ 0 & \mathrm{otherwise}\end{array}\right.$
If $\theta$ is $39$, then we would expect to see the datasets $(39,39)$, $(39,40)$, $(40,39)$ and $(40,40)$ all with equal probability $1/4$. Consider the confidence interval
$[\theta_\mathrm{min}(D),\theta_\mathrm{max}(D)] = [\mathrm{min}(x_1,x_2), \mathrm{max}(x_1,x_2)]$.
Clearly this is a valid 75% confidence interval because if you re-sampled the data, $D = (x_1,x_2)$, many times then the confidence interval constructed in this way would contain the true value 75% of the time.
Now consider the data $D = (29,29)$. In this case the frequentist 75% confidence interval would be $[29, 29]$. However, assuming the model of the generating process is correct, $\theta$ could be 28 or 29 in this case, and we have no reason to suppose that 29 is more likely than 28, so the posterior probability is $p(\theta=28|D) = p(\theta=29|D) = 1/2$. So in this case the frequentist confidence interval is clearly not a 75% credible interval as there is only a 50% probability that it contains the true value of $\theta$, given what we can infer about $\theta$ from this particular sample.
Yes, this is a contrived example, but if confidence intervals and credible intervals were not different, then they would still be identical in contrived examples.
Note the key difference is that the confidence interval is a statement about what would happen if you repeated the experiment many times, the credible interval is a statement about what can be inferred from this particular sample.
Update: With the benefit of a few years' hindsight, I've penned a more concise treatment of essentially the same material in response to a similar question.
How to Construct a Confidence Region
Let us begin with a general method for constructing confidence regions. It can be applied to a single parameter, to yield a confidence interval or set of intervals; and it can be applied to two or more parameters, to yield higher dimensional confidence regions.
We assert that the observed statistics $D$ originate from a distribution with parameters $\theta$, namely the sampling distribution $s(d|\theta)$ over possible statistics $d$, and seek a confidence region for $\theta$ in the set of possible values $\Theta$. Define a Highest Density Region (HDR): the $h$-HDR of a PDF is the smallest subset of its domain that supports probability $h$. Denote the $h$-HDR of $s(d|\psi)$ as $H_\psi$, for any $\psi \in \Theta$. Then, the $h$ confidence region for $\theta$, given data $D$, is the set $C_D = \{ \phi : D \in H_\phi \}$. A typical value of $h$ would be 0.95.
A Frequentist Interpretation
From the preceding definition of a confidence region follows
$$
d \in H_\psi \longleftrightarrow \psi \in C_d
$$
with $C_d = \{ \phi : d \in H_\phi \}$. Now imagine a large set of (imaginary) observations $\{D_i\}$, taken under similar circumstances to $D$. i.e. They are samples from $s(d|\theta)$. Since $H_\theta$ supports probability mass $h$ of the PDF $s(d|\theta)$, $P(D_i \in H_\theta) = h$ for all $i$. Therefore, the fraction of $\{D_i\}$ for which $D_i \in H_\theta$ is $h$. And so, using the equivalence above, the fraction of $\{D_i\}$ for which $\theta \in C_{D_i}$ is also $h$.
This, then, is what the frequentist claim for the $h$ confidence region for $\theta$ amounts to:
Take a large number of imaginary observations $\{D_i\}$ from the sampling distribution $s(d|\theta)$ that gave rise to the observed statistics $D$. Then, $\theta$ lies within a fraction $h$ of the analogous but imaginary confidence regions $\{C_{D_i}\}$.
The confidence region $C_D$ therefore does not make any claim about the probability that $\theta$ lies somewhere! The reason is simply that there is nothing in the fomulation that allows us to speak of a probability distribution over $\theta$. The interpretation is just elaborate superstructure, which does not improve the base. The base is only $s(d | \theta)$ and $D$, where $\theta$ does not appear as a distributed quantity, and there is no information we can use to address that. There are basically two ways to get a distribution over $\theta$:
- Assign a distribution directly from the information at hand: $p(\theta | I)$.
- Relate $\theta$ to another distributed quantity: $p(\theta | I) = \int p(\theta x | I) dx = \int p(\theta | x I) p(x | I) dx$.
In both cases, $\theta$ must appear on the left somewhere. Frequentists cannot use either method, because they both require a heretical prior.
A Bayesian View
The most a Bayesian can make of the $h$ confidence region $C_D$, given without qualification, is simply the direct interpretation: that it is the set of $\phi$ for which $D$ falls in the $h$-HDR $H_\phi$ of the sampling distribution $s(d|\phi)$. It does not necessarily tell us much about $\theta$, and here's why.
The probability that $\theta \in C_D$, given $D$ and the background information $I$, is:
\begin{align*}
P(\theta \in C_D | DI) &= \int_{C_D} p(\theta | DI) d\theta \\
&= \int_{C_D} \frac{p(D | \theta I) p(\theta | I)}{p(D | I)} d\theta
\end{align*}
Notice that, unlike the frequentist interpretation, we have immediately demanded a distribution over $\theta$. The background information $I$ tells us, as before, that the sampling distribution is $s(d | \theta)$:
\begin{align*}
P(\theta \in C_D | DI) &= \int_{C_D} \frac{s(D | \theta) p(\theta | I)}{p(D | I)} d \theta \\
&= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{p(D | I)} \\
\text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{\int s(D | \theta) p(\theta | I) d\theta}
\end{align*}
Now this expression does not in general evaluate to $h$, which is to say, the $h$ confidence region $C_D$ does not always contain $\theta$ with probability $h$. In fact it can be starkly different from $h$. There are, however, many common situations in which it does evaluate to $h$, which is why confidence regions are often consistent with our probabilistic intuitions.
For example, suppose that the prior joint PDF of $d$ and $\theta$ is symmetric in that $p_{d,\theta}(d,\theta | I) = p_{d,\theta}(\theta,d | I)$. (Clearly this involves an assumption that the PDF ranges over the same domain in $d$ and $\theta$.) Then, if the prior is $p(\theta | I) = f(\theta)$, we have $s(D | \theta) p(\theta | I) = s(D | \theta) f(\theta) = s(\theta | D) f(D)$. Hence
\begin{align*}
P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(\theta | D) d\theta}{\int s(\theta | D) d\theta} \\
\text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \int_{C_D} s(\theta | D) d\theta
\end{align*}
From the definition of an HDR we know that for any $\psi \in \Theta$
\begin{align*}
\int_{H_\psi} s(d | \psi) dd &= h \\
\text{and therefore that} \quad\quad \int_{H_D} s(d | D) dd &= h \\
\text{or equivalently} \quad\quad \int_{H_D} s(\theta | D) d\theta &= h
\end{align*}
Therefore, given that $s(d | \theta) f(\theta) = s(\theta | d) f(d)$, $C_D = H_D$ implies $P(\theta \in C_D | DI) = h$. The antecedent satisfies
$$
C_D = H_D \longleftrightarrow \forall \psi \; [ \psi \in C_D \leftrightarrow \psi \in H_D ]
$$
Applying the equivalence near the top:
$$
C_D = H_D \longleftrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]
$$
Thus, the confidence region $C_D$ contains $\theta$ with probability $h$ if for all possible values $\psi$ of $\theta$, the $h$-HDR of $s(d | \psi)$ contains $D$ if and only if the $h$-HDR of $s(d | D)$ contains $\psi$.
Now the symmetric relation $D \in H_\psi \leftrightarrow \psi \in H_D$ is satisfied for all $\psi$ when $s(\psi + \delta | \psi) = s(D - \delta | D)$ for all $\delta$ that span the support of $s(d | D)$ and $s(d | \psi)$. We can therefore form the following argument:
- $s(d | \theta) f(\theta) = s(\theta | d) f(d)$ (premise)
- $\forall \psi \; \forall \delta \; [ s(\psi + \delta | \psi) = s(D - \delta | D) ]$ (premise)
- $\forall \psi \; \forall \delta \; [ s(\psi + \delta | \psi) = s(D - \delta | D) ] \longrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$
- $\therefore \quad \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$
- $\forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ] \longrightarrow C_D = H_D$
- $\therefore \quad C_D = H_D$
- $[s(d | \theta) f(\theta) = s(\theta | d) f(d) \wedge C_D = H_D] \longrightarrow P(\theta \in C_D | DI) = h$
- $\therefore \quad P(\theta \in C_D | DI) = h$
Let's apply the argument to a confidence interval on the mean of a 1-D normal distribution $(\mu, \sigma)$, given a sample mean $\bar{x}$ from $n$ measurements. We have $\theta = \mu$ and $d = \bar{x}$, so that the sampling distribution is
$$
s(d | \theta) = \frac{\sqrt{n}}{\sigma \sqrt{2 \pi}} e^{-\frac{n}{2 \sigma^2} { \left( d - \theta \right) }^2 }
$$
Suppose also that we know nothing about $\theta$ before taking the data (except that it's a location parameter) and therefore assign a uniform prior: $f(\theta) = k$. Clearly we now have $s(d | \theta) f(\theta) = s(\theta | d) f(d)$, so the first premise is satisfied. Let $s(d | \theta) = g\left( (d - \theta)^2 \right)$. (i.e. It can be written in that form.) Then
\begin{gather*}
s(\psi + \delta | \psi) = g \left( (\psi + \delta - \psi)^2 \right) = g(\delta^2) \\
\text{and} \quad\quad s(D - \delta | D) = g \left( (D - \delta - D)^2 \right) = g(\delta^2) \\
\text{so that} \quad\quad \forall \psi \; \forall \delta \; [s(\psi + \delta | \psi) = s(D - \delta | D)]
\end{gather*}
whereupon the second premise is satisfied. Both premises being true, the eight-point argument leads us to conclude that the probability that $\theta$ lies in the confidence interval $C_D$ is $h$!
We therefore have an amusing irony:
- The frequentist who assigns the $h$ confidence interval cannot say that $P(\theta \in C_D) = h$, no matter how innocently uniform $\theta$ looks before incorporating the data.
- The Bayesian who would not assign an $h$ confidence interval in that way knows anyhow that $P(\theta \in C_D | DI) = h$.
Final Remarks
We have identified conditions (i.e. the two premises) under which the $h$ confidence region does indeed yield probability $h$ that $\theta \in C_D$. A frequentist will baulk at the first premise, because it involves a prior on $\theta$, and this sort of deal-breaker is inescapable on the route to a probability. But for a Bayesian, it is acceptable---nay, essential. These conditions are sufficient but not necessary, so there are many other circumstances under which the Bayesian $P(\theta \in C_D | DI)$ equals $h$. Equally though, there are many circumstances in which $P(\theta \in C_D | DI) \ne h$, especially when the prior information is significant.
We have applied a Bayesian analysis just as a consistent Bayesian would, given the information at hand, including statistics $D$. But a Bayesian, if he possibly can, will apply his methods to the raw measurements instead---to the $\{x_i\}$, rather than $\bar{x}$. Oftentimes, collapsing the raw data into summary statistics $D$ destroys information in the data; and then the summary statistics are incapable of speaking as eloquently as the original data about the parameters $\theta$.
Best Answer
The confusion comes from this sentence:
It is a partial misunderstanding of the real consensus. The confusion comes from not being specific about what probability we talk about. Not as a philosophical question but as "what exact probability we are speaking of in the context". As @ratsalad says it's all about conditioning.
Call $\theta$ your parameter, $X$ your data, $I$ an interval that is a function of $X$:
Both are probability of the same event but conditioned differently.
The reason why one discourages saying "the probability that $\theta$ is in $I$ is 0.95" for confidence intervals is because this sentence implicitly means the second point: when we say "the probability that..." the conditioning is implicitly to what has been observed before: "I have seen some $X$, now what is the probability that $\theta$ is..." is formally "what is $P(\theta...\mid X)$".
This implicit is reinforced by the (again implicit) suggestion you experience when reading "probability that $\theta$ is in $I$" that $\theta$ is the variable and $I$ the fixed object, while in frequentist analysis it is the opposite.
Finally this is made even worse when you replace $I$ by your calculated interval. If you write: "The probability that $\theta$ is in $[4;5]$ is 0.95" then this is simply false. In frequentist analysis "$\theta$ is in $[4;5]$" is either true or false but is not a random event thus it does not have a probability (other than 0 or 1). Thus the sentence could only be meaningfully interpreted as the Bayesian one.