I think the fundamental problem is that frequentist statistics can only assign a probability to something that can have a long run frequency. Whether the true value of a parameter lies in a particular interval or not doesn't have a long run frequency, becuase we can only perform the experiment once, so you can't assign a frequentist probability to it. The problem arises from the definition of a probability. If you change the definition of a probability to a Bayesian one, then the problem instantly dissapears as you are no longer tied to discussion of long run frequencies.
See my (rather tounge in cheek) answer to a related question here:
"A Frequentist is someone that believes probabilies represent long run frequencies with which events ocurr; if needs be, he will invent a fictitious population from which your particular situation could be considered a random sample so that he can meaningfully talk about long run frequencies. If you ask him a question about a particular situation, he will not give a direct answer, but instead make a statement about this (possibly imaginary) population."
In the case of a confidence interval, the question we normally would like to ask (unless we have a problem in quality control for example) is "given this sample of data, return the smallest interval that contains the true value of the parameter with probability X". However a frequentist can't do this as the experiment is only performed once and so there are no long run frequencies that can be used to assign a probability. So instead the frequentist has to invent a population of experiments (that you didn't perform) from which the experiment you did perform can be considered a random sample. The frequentist then gives you an indirect answer about that fictitious population of experiments, rather than a direct answer to the question you really wanted to ask about a particular experiment.
Essentially it is a problem of language, the frequentist definition of a popuation simply doesn't allow discussion of the probability of the true value of a parameter lying in a particular interval. That doesn't mean frequentist statistics are bad, or not useful, but it is important to know the limitations.
Regarding the major update
I am not sure we can say that "Before we calculate a 95% confidence interval, there is a 95% probability that the interval we calculate will cover the true parameter." within a frequentist framework. There is an implicit inference here that the long run frequency with which the true value of the parameter lies in confidence intervals constructed by some particular method is also the probability that that the true value of the parameter will lie in the confidence interval for the particular sample of data we are going to use. This is a perfectly reasonable inference, but it is a Bayesian inference, not a frequentist one, as the probability that the true value of the parameter lies in the confidence interval that we construct for a particular sample of data has no long run freqency, as we only have one sample of data. This is exactly the danger of frequentist statistics, common sense reasoning about probability is generally Bayesian, in that it is about the degree of plausibility of a proposition.
We can however "make some sort of non-frequentist argument that we're 95% sure the true parameter will lie in [a,b]", that is exactly what a Bayesian credible interval is, and for many problems the Bayesian credible interval exactly coincides with the frequentist confidence interval.
"I don't want to make this a debate about the philosophy of probability", sadly this is unavoidable, the reason you can't assign a frequentist probability to whether the true value of the statistic lies in the confidence interval is a direct consequence of the frequentist philosophy of probability. Frequentists can only assign probabilities to things that can have long run frequencies, as that is how frequentists define probability in their philosophy. That doesn't make frequentist philosophy wrong, but it is important to understand the bounds imposed by the definition of a probability.
"Before I've entered the password and seen the interval (but after the computer has already calculated it), what's the probability that the interval will contain the true parameter? It's 95%, and this part is not up for debate:" This is incorrect, or at least in making such a statement, you have departed from the framework of frequentist statistics and have made a Bayesian inference involving a degree of plausibility in the truth of a statement, rather than a long run frequency. However, as I have said earlier, it is a perfectly reasonable and natural inference.
Nothing has changed before or after entering the password, because niether event can be assigned a frequentist probability. Frequentist statistics can be rather counter-intuitive as we often want to ask questions about degrees of plausibility of statements regarding particular events, but this lies outside the remit of frequentist statistics, and this is the origin of most misinterpretations of frequentist procedures.
The confusion comes from this sentence:
And yet, the consensus seems to be that a 95% confidence interval can NOT be interpreted as there being a 95% probability that the interval contains the true mean.
It is a partial misunderstanding of the real consensus. The confusion comes from not being specific about what probability we talk about. Not as a philosophical question but as "what exact probability we are speaking of in the context". As @ratsalad says it's all about conditioning.
Call $\theta$ your parameter, $X$ your data, $I$ an interval that is a function of $X$:
- $I$ is a confidence interval means $P(\theta\in I\mid\theta)>0.95$ for
all possible $\theta$ including the true one. Probability averages
over all possible $X$ at fixed $\theta$. This is what you explain in your interpretation.
- $I$ being a (Bayesian) credible interval says $P(\theta\in
I\mid X)>0.95$. Probability averages over all possible $\theta$ at fixed
$X$.
Both are probability of the same event but conditioned differently.
The reason why one discourages saying "the probability that $\theta$ is in $I$ is 0.95" for confidence intervals is because this sentence implicitly means the second point: when we say "the probability that..." the conditioning is implicitly to what has been observed before: "I have seen some $X$, now what is the probability that $\theta$ is..." is formally "what is $P(\theta...\mid X)$".
This implicit is reinforced by the (again implicit) suggestion you experience when reading "probability that $\theta$ is in $I$" that $\theta$ is the variable and $I$ the fixed object, while in frequentist analysis it is the opposite.
Finally this is made even worse when you replace $I$ by your calculated interval. If you write: "The probability that $\theta$ is in $[4;5]$ is 0.95" then this is simply false. In frequentist analysis "$\theta$ is in $[4;5]$" is either true or false but is not a random event thus it does not have a probability (other than 0 or 1). Thus the sentence could only be meaningfully interpreted as the Bayesian one.
Best Answer
The key concepts with confidence intervals are coverage, correctness and accuracy.
Coverage
The coverage or confidence level should be explained first. It is the percentage of times the random interval is expected to include the true value of the parameter.
The way this is best shown is to take a probability statement about a pivotal statistic and show how the statement is inverted to get a confidence interval. An example might be getting a confidence interval for the mean $\mu$ of a normal distribution when the variance is known to be $1$. Let the sample be denoted $X_i$, $i=1,2,\ldots,n$. The students will know from undergraduate courses or have been taught earlier in this particular graduate course that the sample mean $X_b=\sum X_i/n$ is normal with mean $\mu$ and variance $1/n$. Then the pivotal quantity is $$Z= \sqrt{n} (X_b - \mu)$$ and $Z$ has a $N(0,1)$ distribution. So of course $\Pr(|Z| \le 1.96) = 0.95$ (from a table of the standard normal distribution). You do the inversion to show that this probability is the same as $\Pr(X_b-1.96/\sqrt{n} \le \mu \le X_b+1.96/\sqrt{n})$. Then the random interval $[X_b-1.96/\sqrt{n}, X_b+1.96/\sqrt{n}]$ is a prescription for a 95% confidence interval for $\mu$.
It should be clear that if you repeated an experiment many times where each time you randomly select $n$ observations from an $N(\mu, 1)$ distribution, then in close to 95% of the cases the interval will contain $\mu$ and of course it also means that in the remaining approximate 5% of the cases $\mu$ will lie outside the interval. This is how I would explain an exact 95% confidence interval.
You could certainly use some other simple example such as estimating the rate parameter for an exponential distribution. The idea is to construct a pivotal quantity whose distribution is known and is independent of any unknown parameters.
To explain the relationship between coverage and confidence, you can just point out that if you substituted $1.645$ for $1.96$ in the original probability statement for $Z$ you would get a probability of $0.90$ and hence by replacing $1.96$ by $1.645$ in the confidence interval prescription you would get a 90% confidence interval. This also illustrates how lowering the coverage tightens the width of the interval.
Accuracy
The other two important concepts Efron calls accuracy and correctness. I like using that terminology. We looked at examples of exact confidence intervals. They are accurate in the sense that the nominal coverage of 95% is the exact coverage probability. But sometimes it is convenient to use asymptotic theory. Instead of using the exact distribution for the pivotal quantity, we compute a distribution that it will converge to as the sample size $n$ goes to infinity. Using this asymptotic distribution, the coverage of an advertised 95% confidence interval will not be exact for given values of $n$. But if the approximation is good we can say that the approximate confidence interval is reasonably accurate.
(This is important in the bootstrap literature for confidence intervals because bootstrap confidence intervals are never exact and in some situations certain bootstrap variants (e.g. the BCa method) give more accurate intervals than others. The bootstrap theory on order of accuracy was developed by Peter Hall and others who defined accuracy by the rate the interval approaches the advertised confidence level as $n$ goes to infinity. The results involve the use of Edgeworth expansions and can be found detailed in Hall's book The Bootstrap and Edgeworth Expansion.)
Correctness
Last of all I would discuss correctness. For many problems there are several ways to construct confidence intervals with exact or asymptotic coverage 95%. How do we choose between them? Well they will have different average lengths. An exact confidence interval that has the shortest expected length is called correct and is the optimal one to choose. When they exist and an efficient estimator of the parameter exists, correct confidence intervals can be constructed by choosing an efficient estimate for the parameter.