Part of the issue is that the frequentist definition of a probability doesn't allow a nontrivial probability to be applied to the outcome of a particular experiment, but only to some fictitious population of experiments from which this particular experiment can be considered a sample. The definition of a CI is confusing as it is a statement about this (usually) fictitious population of experiments, rather than about the particular data collected in the instance at hand. So part of the issue is one of the definition of a probability: The idea of the true value lying within a particular interval with probability 95% is inconsistent with a frequentist framework.
Another aspect of the issue is that the calculation of the frequentist confidence doesn't use all of the information contained in the particular sample relevant to bounding the true value of the statistic. My question "Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals" discusses a paper by Edwin Jaynes which has some really good examples that really highlight the difference between confidence intervals and credible intervals. One that is particularly relevant to this discussion is Example 5, which discusses the difference between a credible and a confidence interval for estimating the parameter of a truncated exponential distribution (for a problem in industrial quality control). In the example he gives, there is enough information in the sample to be certain that the true value of the parameter lies nowhere in a properly constructed 90% confidence interval!
This may seem shocking to some, but the reason for this result is that confidence intervals and credible intervals are answers to two different questions, from two different interpretations of probability.
The confidence interval is the answer to the request: "Give me an interval that will bracket the true value of the parameter in $100p$% of the instances of an experiment that is repeated a large number of times." The credible interval is an answer to the request: "Give me an interval that brackets the true value with probability $p$ given the particular sample I've actually observed." To be able to answer the latter request, we must first adopt either (a) a new concept of the data generating process or (b) a different concept of the definition of probability itself.
The main reason that any particular 95% confidence interval does not imply a 95% chance of containing the mean is because the confidence interval is an answer to a different question, so it is only the right answer when the answer to the two questions happens to have the same numerical solution.
In short, credible and confidence intervals answer different questions from different perspectives; both are useful, but you need to choose the right interval for the question you actually want to ask. If you want an interval that admits an interpretation of a 95% (posterior) probability of containing the true value, then choose a credible interval (and, with it, the attendant conceptualization of probability), not a confidence interval. The thing you ought not to do is to adopt a different definition of probability in the interpretation than that used in the analysis.
Thanks to @cardinal for his refinements!
Here is a concrete example, from David MaKay's excellent book "Information Theory, Inference and Learning Algorithms" (page 464):
Let the parameter of interest be $\theta$ and the data $D$, a pair of points $x_1$ and $x_2$ drawn independently from the following distribution:
$p(x|\theta) = \left\{\begin{array}{cl} 1/2 & x = \theta,\\1/2 & x = \theta + 1, \\ 0 & \mathrm{otherwise}\end{array}\right.$
If $\theta$ is $39$, then we would expect to see the datasets $(39,39)$, $(39,40)$, $(40,39)$ and $(40,40)$ all with equal probability $1/4$. Consider the confidence interval
$[\theta_\mathrm{min}(D),\theta_\mathrm{max}(D)] = [\mathrm{min}(x_1,x_2), \mathrm{max}(x_1,x_2)]$.
Clearly this is a valid 75% confidence interval because if you re-sampled the data, $D = (x_1,x_2)$, many times then the confidence interval constructed in this way would contain the true value 75% of the time.
Now consider the data $D = (29,29)$. In this case the frequentist 75% confidence interval would be $[29, 29]$. However, assuming the model of the generating process is correct, $\theta$ could be 28 or 29 in this case, and we have no reason to suppose that 29 is more likely than 28, so the posterior probability is $p(\theta=28|D) = p(\theta=29|D) = 1/2$. So in this case the frequentist confidence interval is clearly not a 75% credible interval as there is only a 50% probability that it contains the true value of $\theta$, given what we can infer about $\theta$ from this particular sample.
Yes, this is a contrived example, but if confidence intervals and credible intervals were not different, then they would still be identical in contrived examples.
Note the key difference is that the confidence interval is a statement about what would happen if you repeated the experiment many times, the credible interval is a statement about what can be inferred from this particular sample.
Best Answer
There is a reasonably well-developed physics/statistics literature on this topic
There is actually a fairly well-developed literature on this matter in physics and statistics journals. Some good papers on this topic are Zhang-Yuan and Bin (1985), Keller (1986), Vulović and Prange (1986), Gelman and Nolan (2002), Mizuguchi and Suwashita (2006), Diaconis, Holmes and Montgomery (2007) and Strzałko, Grabski, Stefański, Perlikowski, Kapitaniak (2008). The general method in the literature on this topic is to develop physics models of the coin-tossing process (under idealised conditions) and examine the sensitivity of the outcome to initial conditions such as the coin velocity, angular velocity, etc. A coin-toss is generally highly sensitive to initial conditions, which means that even slight random variation in these initial conditions will lead to a probability that is extremely close to "fair" for the outcome of the toss. The literature involves a combination of physics models and empirical tests, some of which are done with mechanical coin-flipping mechanisms and some of which are done by humans.
The details of the results in the literature are somewhat complicated, but the general theme that emerges is that the mechanics of a coin-toss is so sensitive to small changes in initial conditions that it is quite "chaotic" and so the uniformity assumption (i.e., equal chances of heads/tails) is reasonable in practice. In particular, if the coin is allowed to bounce after the flips, before coming to rest, this introduces extreme sensitivity to initial conditions. Diaconis, Holmes and Montgomery (2007) show that you can get results that depart a bit from uniformity under extremely idealised conditions (no air resistance, no bouncing of the coin, high consistency of initial angular velocity, etc.), but this is not reflective of most practical cases of coin-flips by humans. Strzałko et al (2008, pp. 90-91) conclude the following:
If you have a good look at the literature on this topic you will see that you can obtain a coin toss that is extremely close to "fair" by tossing with a sufficiently large angular velocity to ensures that the coin will spin a reasonable number of times before landing. You can also get closer to fairness by allowing the coin to bounce on a hard surface instead of catching it. And of course, if you pick up the coin for your flip in a manner where you pay no attention to the initial face-up side, then the initial condition is arguably already uniform, which would imply uniformity of the outcome. Good luck with improving your coin-tossing technique!