I think the fundamental problem is that frequentist statistics can only assign a probability to something that can have a long run frequency. Whether the true value of a parameter lies in a particular interval or not doesn't have a long run frequency, becuase we can only perform the experiment once, so you can't assign a frequentist probability to it. The problem arises from the definition of a probability. If you change the definition of a probability to a Bayesian one, then the problem instantly dissapears as you are no longer tied to discussion of long run frequencies.
See my (rather tounge in cheek) answer to a related question here:
"A Frequentist is someone that believes probabilies represent long run frequencies with which events ocurr; if needs be, he will invent a fictitious population from which your particular situation could be considered a random sample so that he can meaningfully talk about long run frequencies. If you ask him a question about a particular situation, he will not give a direct answer, but instead make a statement about this (possibly imaginary) population."
In the case of a confidence interval, the question we normally would like to ask (unless we have a problem in quality control for example) is "given this sample of data, return the smallest interval that contains the true value of the parameter with probability X". However a frequentist can't do this as the experiment is only performed once and so there are no long run frequencies that can be used to assign a probability. So instead the frequentist has to invent a population of experiments (that you didn't perform) from which the experiment you did perform can be considered a random sample. The frequentist then gives you an indirect answer about that fictitious population of experiments, rather than a direct answer to the question you really wanted to ask about a particular experiment.
Essentially it is a problem of language, the frequentist definition of a popuation simply doesn't allow discussion of the probability of the true value of a parameter lying in a particular interval. That doesn't mean frequentist statistics are bad, or not useful, but it is important to know the limitations.
Regarding the major update
I am not sure we can say that "Before we calculate a 95% confidence interval, there is a 95% probability that the interval we calculate will cover the true parameter." within a frequentist framework. There is an implicit inference here that the long run frequency with which the true value of the parameter lies in confidence intervals constructed by some particular method is also the probability that that the true value of the parameter will lie in the confidence interval for the particular sample of data we are going to use. This is a perfectly reasonable inference, but it is a Bayesian inference, not a frequentist one, as the probability that the true value of the parameter lies in the confidence interval that we construct for a particular sample of data has no long run freqency, as we only have one sample of data. This is exactly the danger of frequentist statistics, common sense reasoning about probability is generally Bayesian, in that it is about the degree of plausibility of a proposition.
We can however "make some sort of non-frequentist argument that we're 95% sure the true parameter will lie in [a,b]", that is exactly what a Bayesian credible interval is, and for many problems the Bayesian credible interval exactly coincides with the frequentist confidence interval.
"I don't want to make this a debate about the philosophy of probability", sadly this is unavoidable, the reason you can't assign a frequentist probability to whether the true value of the statistic lies in the confidence interval is a direct consequence of the frequentist philosophy of probability. Frequentists can only assign probabilities to things that can have long run frequencies, as that is how frequentists define probability in their philosophy. That doesn't make frequentist philosophy wrong, but it is important to understand the bounds imposed by the definition of a probability.
"Before I've entered the password and seen the interval (but after the computer has already calculated it), what's the probability that the interval will contain the true parameter? It's 95%, and this part is not up for debate:" This is incorrect, or at least in making such a statement, you have departed from the framework of frequentist statistics and have made a Bayesian inference involving a degree of plausibility in the truth of a statement, rather than a long run frequency. However, as I have said earlier, it is a perfectly reasonable and natural inference.
Nothing has changed before or after entering the password, because niether event can be assigned a frequentist probability. Frequentist statistics can be rather counter-intuitive as we often want to ask questions about degrees of plausibility of statements regarding particular events, but this lies outside the remit of frequentist statistics, and this is the origin of most misinterpretations of frequentist procedures.
Best Answer
This answer analyzes the meaning of the quotation and offers the results of a simulation study to illustrate it and help understand what it might be trying to say. The study can easily be extended by anybody (with rudimentary
R
skills) to explore other confidence interval procedures and other models.Two interesting issues emerged in this work. One concerns how to evaluate the accuracy of a confidence interval procedure. The impression one gets of robustness depends on that. I display two different accuracy measures so you can compare them.
The other issue is that although a confidence interval procedure with low confidence may be robust, the corresponding confidence limits might not be robust at all. Intervals tend to work well because the errors they make at one end often counterbalance the errors they make at the other. As a practical matter, you can be pretty sure that around half of your $50\%$ confidence intervals are covering their parameters, but the actual parameter might consistently lie near one particular end of each interval, depending on how reality departs from your model assumptions.
Robust has a standard meaning in statistics:
(Hoaglin, Mosteller, and Tukey, Understanding Robust and Exploratory Data Analysis. J. Wiley (1983), p. 2.)
This is consistent with the quotation in the question. To understand the quotation we still need to know the intended purpose of a confidence interval. To this end, let's review what Gelman wrote.
Since getting a sense of predicted values is not what confidence intervals (CIs) are intended for, I will focus on getting a sense of parameter values, which is what CIs do. Let's call these the "target" values. Whence, by definition, a CI is intended to cover its target with a specified probability (its confidence level). Achieving intended coverage rates is the minimum criterion for evaluating the quality of any CI procedure. (Additionally, we might be interested in typical CI widths. To keep the post to a reasonable length, I will ignore this issue.)
These considerations invite us to study how much a confidence interval calculation could mislead us concerning the target parameter value. The quotation could be read as suggesting that lower-confidence CIs might retain their coverage even when the data are generated by a process different than the model. That's something we can test. The procedure is:
Adopt a probability model that includes at least one parameter. The classic one is sampling from a Normal distribution of unknown mean and variance.
Select a CI procedure for one or more of the model's parameters. An excellent one constructs the CI from the sample mean and sample standard deviation, multiplying the latter by a factor given by a Student t distribution.
Apply that procedure to various different models--departing not too much from the adopted one--to assess its coverage over a range of confidence levels.
As an example, I have done just that. I have allowed the underlying distribution to vary across a wide range, from almost Bernoulli, to Uniform, to Normal, to Exponential, and all the way to Lognormal. These include symmetric distributions (the first three) and strongly skewed ones (the last two). For each distribution I generated 50,000 samples of size 12. For each sample I constructed two-sided CIs of confidence levels between $50\%$ and $99.8\%$, which covers most applications.
An interesting issue now arises: How should we measure how well (or how badly) a CI procedure is performing? A common method simply evaluates the difference between the actual coverage and the confidence level. This can look suspiciously good for high confidence levels, though. For instance, if you are trying to achieve 99.9% confidence but you get only 99% coverage, the raw difference is a mere 0.9%. However, that means your procedure fails to cover the target ten times more often than it should! For this reason, a more informative way of comparing coverages ought to use something like odds ratios. I use differences of logits, which are the logarithms of odds ratios. Specifically, when the desired confidence level is $\alpha$ and the actual coverage is $p$, then
$$\log\left(\frac{p}{1-p}\right) - \log\left(\frac{\alpha}{1-\alpha}\right)$$
nicely captures the difference. When it is zero, the coverage is exactly the value intended. When it is negative, the coverage is too low--which means the CI is too optimistic and underestimates the uncertainty.
The question, then, is how do these error rates vary with confidence level as the underlying model is perturbed? We can answer it by plotting the simulation results. These plots quantify how "unrealistic" the "near-certainty" of a CI might be in this archetypal application.
The graphics show the same results, but the one at the left displays the values on logit scales while the one at the right uses raw scales. The Beta distribution is a Beta$(1/30,1/30)$ (which is practically a Bernoulli distribution). The lognormal distribution is the exponential of the standard Normal distribution. The normal distribution is included to verify that this CI procedure really does attain its intended coverage and to reveal how much variation to expect from the finite simulation size. (Indeed, the graphs for the normal distribution are comfortably close to zero, showing no significant deviations.)
It is clear that on the logit scale, the coverages grow more divergent as the confidence level increases. There are some interesting exceptions, though. If we are unconcerned with perturbations of the model that introduce skewness or long tails, then we can ignore the exponential and lognormal and focus on the rest. Their behavior is erratic until $\alpha$ exceeds $95\%$ or so (a logit of $3$), at which point the divergence has set in.
This little study brings some concreteness to Gelman's claim and illustrates some of the phenomena he might have had in mind. In particular, when we are using a CI procedure with a low confidence level, such as $\alpha=50\%$, then even when the underlying model is strongly perturbed, it looks like the coverage will still be close to $50\%$: our feeling that such a CI will be correct about half the time and incorrect the other half is borne out. That is robust. If instead we are hoping to be right, say, $95\%$ of the time, which means we really want to be wrong only $5\%$ of the time, then we should be prepared for our error rate to be much greater in case the world doesn't work quite the way our model supposes.
Incidentally, this property of $50\%$ CIs holds in large part because we are studying symmetric confidence intervals. For the skewed distributions, the individual confidence limits can be terrible (and not robust at all), but their errors often cancel out. Typically one tail is short and the other long, leading to over-coverage at one end and under-coverage at the other. I believe that $50\%$ confidence limits will not be anywhere near as robust as the corresponding intervals.
This is the
R
code that produced the plots. It is readily modified to study other distributions, other ranges of confidence, and other CI procedures.