Update: With the benefit of a few years' hindsight, I've penned a more concise treatment of essentially the same material in response to a similar question.
How to Construct a Confidence Region
Let us begin with a general method for constructing confidence regions. It can be applied to a single parameter, to yield a confidence interval or set of intervals; and it can be applied to two or more parameters, to yield higher dimensional confidence regions.
We assert that the observed statistics $D$ originate from a distribution with parameters $\theta$, namely the sampling distribution $s(d|\theta)$ over possible statistics $d$, and seek a confidence region for $\theta$ in the set of possible values $\Theta$. Define a Highest Density Region (HDR): the $h$-HDR of a PDF is the smallest subset of its domain that supports probability $h$. Denote the $h$-HDR of $s(d|\psi)$ as $H_\psi$, for any $\psi \in \Theta$. Then, the $h$ confidence region for $\theta$, given data $D$, is the set $C_D = \{ \phi : D \in H_\phi \}$. A typical value of $h$ would be 0.95.
A Frequentist Interpretation
From the preceding definition of a confidence region follows
$$
d \in H_\psi \longleftrightarrow \psi \in C_d
$$
with $C_d = \{ \phi : d \in H_\phi \}$. Now imagine a large set of (imaginary) observations $\{D_i\}$, taken under similar circumstances to $D$. i.e. They are samples from $s(d|\theta)$. Since $H_\theta$ supports probability mass $h$ of the PDF $s(d|\theta)$, $P(D_i \in H_\theta) = h$ for all $i$. Therefore, the fraction of $\{D_i\}$ for which $D_i \in H_\theta$ is $h$. And so, using the equivalence above, the fraction of $\{D_i\}$ for which $\theta \in C_{D_i}$ is also $h$.
This, then, is what the frequentist claim for the $h$ confidence region for $\theta$ amounts to:
Take a large number of imaginary observations $\{D_i\}$ from the sampling distribution $s(d|\theta)$ that gave rise to the observed statistics $D$. Then, $\theta$ lies within a fraction $h$ of the analogous but imaginary confidence regions $\{C_{D_i}\}$.
The confidence region $C_D$ therefore does not make any claim about the probability that $\theta$ lies somewhere! The reason is simply that there is nothing in the fomulation that allows us to speak of a probability distribution over $\theta$. The interpretation is just elaborate superstructure, which does not improve the base. The base is only $s(d | \theta)$ and $D$, where $\theta$ does not appear as a distributed quantity, and there is no information we can use to address that. There are basically two ways to get a distribution over $\theta$:
- Assign a distribution directly from the information at hand: $p(\theta | I)$.
- Relate $\theta$ to another distributed quantity: $p(\theta | I) = \int p(\theta x | I) dx = \int p(\theta | x I) p(x | I) dx$.
In both cases, $\theta$ must appear on the left somewhere. Frequentists cannot use either method, because they both require a heretical prior.
A Bayesian View
The most a Bayesian can make of the $h$ confidence region $C_D$, given without qualification, is simply the direct interpretation: that it is the set of $\phi$ for which $D$ falls in the $h$-HDR $H_\phi$ of the sampling distribution $s(d|\phi)$. It does not necessarily tell us much about $\theta$, and here's why.
The probability that $\theta \in C_D$, given $D$ and the background information $I$, is:
\begin{align*}
P(\theta \in C_D | DI) &= \int_{C_D} p(\theta | DI) d\theta \\
&= \int_{C_D} \frac{p(D | \theta I) p(\theta | I)}{p(D | I)} d\theta
\end{align*}
Notice that, unlike the frequentist interpretation, we have immediately demanded a distribution over $\theta$. The background information $I$ tells us, as before, that the sampling distribution is $s(d | \theta)$:
\begin{align*}
P(\theta \in C_D | DI) &= \int_{C_D} \frac{s(D | \theta) p(\theta | I)}{p(D | I)} d \theta \\
&= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{p(D | I)} \\
\text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{\int s(D | \theta) p(\theta | I) d\theta}
\end{align*}
Now this expression does not in general evaluate to $h$, which is to say, the $h$ confidence region $C_D$ does not always contain $\theta$ with probability $h$. In fact it can be starkly different from $h$. There are, however, many common situations in which it does evaluate to $h$, which is why confidence regions are often consistent with our probabilistic intuitions.
For example, suppose that the prior joint PDF of $d$ and $\theta$ is symmetric in that $p_{d,\theta}(d,\theta | I) = p_{d,\theta}(\theta,d | I)$. (Clearly this involves an assumption that the PDF ranges over the same domain in $d$ and $\theta$.) Then, if the prior is $p(\theta | I) = f(\theta)$, we have $s(D | \theta) p(\theta | I) = s(D | \theta) f(\theta) = s(\theta | D) f(D)$. Hence
\begin{align*}
P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(\theta | D) d\theta}{\int s(\theta | D) d\theta} \\
\text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \int_{C_D} s(\theta | D) d\theta
\end{align*}
From the definition of an HDR we know that for any $\psi \in \Theta$
\begin{align*}
\int_{H_\psi} s(d | \psi) dd &= h \\
\text{and therefore that} \quad\quad \int_{H_D} s(d | D) dd &= h \\
\text{or equivalently} \quad\quad \int_{H_D} s(\theta | D) d\theta &= h
\end{align*}
Therefore, given that $s(d | \theta) f(\theta) = s(\theta | d) f(d)$, $C_D = H_D$ implies $P(\theta \in C_D | DI) = h$. The antecedent satisfies
$$
C_D = H_D \longleftrightarrow \forall \psi \; [ \psi \in C_D \leftrightarrow \psi \in H_D ]
$$
Applying the equivalence near the top:
$$
C_D = H_D \longleftrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]
$$
Thus, the confidence region $C_D$ contains $\theta$ with probability $h$ if for all possible values $\psi$ of $\theta$, the $h$-HDR of $s(d | \psi)$ contains $D$ if and only if the $h$-HDR of $s(d | D)$ contains $\psi$.
Now the symmetric relation $D \in H_\psi \leftrightarrow \psi \in H_D$ is satisfied for all $\psi$ when $s(\psi + \delta | \psi) = s(D - \delta | D)$ for all $\delta$ that span the support of $s(d | D)$ and $s(d | \psi)$. We can therefore form the following argument:
- $s(d | \theta) f(\theta) = s(\theta | d) f(d)$ (premise)
- $\forall \psi \; \forall \delta \; [ s(\psi + \delta | \psi) = s(D - \delta | D) ]$ (premise)
- $\forall \psi \; \forall \delta \; [ s(\psi + \delta | \psi) = s(D - \delta | D) ] \longrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$
- $\therefore \quad \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$
- $\forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ] \longrightarrow C_D = H_D$
- $\therefore \quad C_D = H_D$
- $[s(d | \theta) f(\theta) = s(\theta | d) f(d) \wedge C_D = H_D] \longrightarrow P(\theta \in C_D | DI) = h$
- $\therefore \quad P(\theta \in C_D | DI) = h$
Let's apply the argument to a confidence interval on the mean of a 1-D normal distribution $(\mu, \sigma)$, given a sample mean $\bar{x}$ from $n$ measurements. We have $\theta = \mu$ and $d = \bar{x}$, so that the sampling distribution is
$$
s(d | \theta) = \frac{\sqrt{n}}{\sigma \sqrt{2 \pi}} e^{-\frac{n}{2 \sigma^2} { \left( d - \theta \right) }^2 }
$$
Suppose also that we know nothing about $\theta$ before taking the data (except that it's a location parameter) and therefore assign a uniform prior: $f(\theta) = k$. Clearly we now have $s(d | \theta) f(\theta) = s(\theta | d) f(d)$, so the first premise is satisfied. Let $s(d | \theta) = g\left( (d - \theta)^2 \right)$. (i.e. It can be written in that form.) Then
\begin{gather*}
s(\psi + \delta | \psi) = g \left( (\psi + \delta - \psi)^2 \right) = g(\delta^2) \\
\text{and} \quad\quad s(D - \delta | D) = g \left( (D - \delta - D)^2 \right) = g(\delta^2) \\
\text{so that} \quad\quad \forall \psi \; \forall \delta \; [s(\psi + \delta | \psi) = s(D - \delta | D)]
\end{gather*}
whereupon the second premise is satisfied. Both premises being true, the eight-point argument leads us to conclude that the probability that $\theta$ lies in the confidence interval $C_D$ is $h$!
We therefore have an amusing irony:
- The frequentist who assigns the $h$ confidence interval cannot say that $P(\theta \in C_D) = h$, no matter how innocently uniform $\theta$ looks before incorporating the data.
- The Bayesian who would not assign an $h$ confidence interval in that way knows anyhow that $P(\theta \in C_D | DI) = h$.
Final Remarks
We have identified conditions (i.e. the two premises) under which the $h$ confidence region does indeed yield probability $h$ that $\theta \in C_D$. A frequentist will baulk at the first premise, because it involves a prior on $\theta$, and this sort of deal-breaker is inescapable on the route to a probability. But for a Bayesian, it is acceptable---nay, essential. These conditions are sufficient but not necessary, so there are many other circumstances under which the Bayesian $P(\theta \in C_D | DI)$ equals $h$. Equally though, there are many circumstances in which $P(\theta \in C_D | DI) \ne h$, especially when the prior information is significant.
We have applied a Bayesian analysis just as a consistent Bayesian would, given the information at hand, including statistics $D$. But a Bayesian, if he possibly can, will apply his methods to the raw measurements instead---to the $\{x_i\}$, rather than $\bar{x}$. Oftentimes, collapsing the raw data into summary statistics $D$ destroys information in the data; and then the summary statistics are incapable of speaking as eloquently as the original data about the parameters $\theta$.
The Neyman-Pearson theory of Null Hypothesis Significance Testing has the goal of providing you with a decision rule which, when the null hypothesis is true, allows you to make the correct choice $95\%$ of times. However, it cannot tell you the confidence interval you computed based on your random sample drawn from the population (i.e., your realization $i=(−11.052802,−4.947198)$ of the random interval $I$) contains the true parameter or not: you just don't know.
Then why do you reject the null hypothesis $H_0$ in this specific case? You do it because you know that, if $H_0$ were true and you repeated this experiment a large number of times, then, following the Neyman-Pearson decision rule, which is:
- accept $H_0$ if $i$ contains 0
- reject $H_0$ if $i$ doesn't contain 0 (as in your case)
you would be wrong only $5\%$ of times. Thus the decision rule is a guide to control your error rate in the long run.
This is very relevant in manufacturing for example, in process quality control. If the manufacturing process is in control, you are effectively sampling multiple times from the same population, thus you expect $5\%$ of your confidence intervals not to contain the parameter of interest. Thus a process in control would raise an alarm $5\%$ of times, which can sound weird (actually, the $\alpha$ level used in quality control is usually much less than $5\%$).
You acutely asked in a comment why not to compute multiple confidence intervals, instead. First of all, in real life, you often can't afford the luxury of performing repeated sampling, because of time, budget, etc. constraints. Secondly, even if you could, it wouldn't make sense to create many such intervals and try to "intersect" them: there's no principled way to do that. Instead, you can gather all your $m$ random samples of size $n$ together and build a confidence interval based on your aggregated sample $\mathbf{x}=(x_{11},\dots,x_{1n},\dots,x_{1m},\dots,x_{nm})$. Since the width of a confidence interval decreases with the sample size $N$ (usually, as $O(\frac{1}{\sqrt{N}})$), the resulting confidence interval will be your most accurate inference (but you still won't be able to know with certainty if it contains the true parameter or not).
Best Answer
I found this thought experiment helpful when thinking about confidence intervals. It also answers your question 3.
Let $X\sim U(0,1)$ and $Y=X+a-\frac{1}{2}$. Consider two observations of $Y$ taking the values $y_1$ and $y_2$ corresponding to observations $x_1$ and $x_2$ of $X$, and let $y_l=\min(y_1,y_2)$ and $y_u=\max(y_1,y_2)$. Then $[y_l,y_u]$ is a 50% confidence interval for $a$ (since the interval includes $a$ if $x_1<\frac12<x_2$ or $x_1>\frac12>x_2$, each of which has probability $\frac14$).
However, if $y_u-y_l>\frac12$ then we know that the probability that the interval contains $a$ is $1$, not $\frac12$. The subtlety is that a $z\%$ confidence interval for a parameter means that the endpoints of the interval (which are random variables) lie either side of the parameter with probability $z\%$ before you calculate the interval, not that the probability of the parameter lying within the interval is $z\%$ after you have calculated the interval.