The events in statements 2 and 3 are obviously equivalent – I interpret them as $CI_D \ni \theta$ and $\theta \in CI_D$ respectively. The issue here is that you are vague about whether you are talking about CIs as random intervals or as fixed intervals after the observed data has been substituted, and you are also vague about whether you are talking about conditional or unconditional probability. Below I will show which mathematical statements about confidence intervals are true/false. So long as you describe these statements correctly in a textual sense (which requires more explicit specification of some issues you're glossing over) you should be fine.
Probabilistic properties of the CI: I'll conduct a purely probabilistic analysis of confidence intervals as mathematical objects, so I'll examine probability statements applying to these objects that are both conditional and unconditional on $\theta$. Note that in the classical framework, the parameter is treated as an "unknown constant" so we (implicitly) condition on it in all probability statements in that context. Nevertheless, I'll look at things more broadly so that you can see what probabilistic statements are true/false within a generalised framework where you examine the CI on a purely mathematical basis.
In order to show you what statements about confidence intervals are true/false, we will use more detailed notation. Let $\text{CI}_\theta(\mathbf{X}, \alpha)$ denote the $1-\alpha$ level confidence interval for $\theta \in \Theta$ using (random) data vector $\mathbf{X}$. This object is a mapping $\text{CI}_\theta: \mathbb{R}^n \times [0,1] \rightarrow \mathfrak{p}(\mathbb{R})$ that maps an input data vector and significance value to a measurable subset of the real numbers. (For a confidence interval the output of the function is a single connected interval, but you can generalise to use confidence sets if you want to remove this restriction.) As I've noted in several other answers (some for questions you link to), an exact confidence interval is defined by the following property:
$$\mathbb{P}(\theta \in \text{CI}_\theta(\mathbf{X}, \alpha) | \theta)
= 1-\alpha \quad \quad \quad \quad \text{for all } \theta \in \Theta.$$
(An approximate confidence interval is one where there is approximate equality, usually relying on asymptotic distributional results.) Substituting the observed data $\mathbf{X}=\mathbf{x}$ then gives the (fixed) confidence interval $\text{CI}_\theta(\mathbf{x}, \alpha)$. To allow us to assess statements about "repeated experiments" we will let $\mathbf{X}_1, \mathbf{X}_2, \mathbf{X}_3, ...$ denote a sequence of IID random vectors with distribution equivalent to the random vector $\mathbf{X}$.
So, assuming you are using an exact confidence interval, the following statements are true/false$^\dagger$:
$$\begin{align}
\mathbb{P}(\theta \in \text{CI}_\theta(\mathbf{X}, \alpha) | \theta)
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{True} \\[12pt]
\mathbb{P}(\text{CI}_\theta(\mathbf{X}, \alpha) \ni \theta | \theta)
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{True} \\[12pt]
\mathbb{P}(\theta \in \text{CI}_\theta(\mathbf{X}, \alpha))
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{True} \\[12pt]
\mathbb{P}(\text{CI}_\theta(\mathbf{X}, \alpha) \ni \theta)
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{True} \\[12pt]
-------------&---------------- \\[6pt]
\mathbb{P}(\theta \in \text{CI}_\theta(\mathbf{x}, \alpha) | \theta)
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{False}^\dagger \\[12pt]
\mathbb{P}(\text{CI}_\theta(\mathbf{x}, \alpha) \ni \theta | \theta)
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{False}^\dagger \\[12pt]
\mathbb{P}(\theta \in \text{CI}_\theta(\mathbf{x}, \alpha))
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{False}^\dagger \\[12pt]
\mathbb{P}(\text{CI}_\theta(\mathbf{x}, \alpha) \ni \theta)
&= 1-\alpha \quad \quad \quad \quad \quad \quad \quad \quad \text{False}^\dagger \\[12pt]
-------------&---------------- \\[6pt]
\mathbb{P} \bigg( \lim_{k \rightarrow \infty} \frac{1}{k} \sum_{i=1}^k \mathbb{I}(\theta \in \text{CI}_\theta(\mathbf{X}_i, \alpha))
&= 1-\alpha \bigg| \theta \bigg) = 1 \quad \quad \quad \quad \ \ \text{True} \\[6pt]
\mathbb{P} \bigg( \lim_{k \rightarrow \infty} \frac{1}{k} \sum_{i=1}^k \mathbb{I}(\theta \in \text{CI}_\theta(\mathbf{X}_i, \alpha))
&= 1-\alpha \bigg) = 1 \quad \quad \quad \quad \quad \ \text{True} \\[6pt]
\end{align}$$
If you are working in the classical ("frequentist") context, you can ignore the marginal probability statements here and focus entirely on the conditional probability statements. (In that context the parameter is an "unknown constant" and so all our probabilistic analysis implicitly conditions on it having a fixed value.) As you can see, the remaining distinction that determines whether the statement is true/false is whether you are talking about the "data" in its random sense or fixed sense. You also need to take care to state these mathematical conditions clearly and accurately.
$^\dagger$ Statements listed as $\text{False}$ are statements that are not true in general. These statements may be true "coincidentally" for some specific values of the inputs.
Best Answer
Update: With the benefit of a few years' hindsight, I've penned a more concise treatment of essentially the same material in response to a similar question.
How to Construct a Confidence Region
Let us begin with a general method for constructing confidence regions. It can be applied to a single parameter, to yield a confidence interval or set of intervals; and it can be applied to two or more parameters, to yield higher dimensional confidence regions.
We assert that the observed statistics $D$ originate from a distribution with parameters $\theta$, namely the sampling distribution $s(d|\theta)$ over possible statistics $d$, and seek a confidence region for $\theta$ in the set of possible values $\Theta$. Define a Highest Density Region (HDR): the $h$-HDR of a PDF is the smallest subset of its domain that supports probability $h$. Denote the $h$-HDR of $s(d|\psi)$ as $H_\psi$, for any $\psi \in \Theta$. Then, the $h$ confidence region for $\theta$, given data $D$, is the set $C_D = \{ \phi : D \in H_\phi \}$. A typical value of $h$ would be 0.95.
A Frequentist Interpretation
From the preceding definition of a confidence region follows $$ d \in H_\psi \longleftrightarrow \psi \in C_d $$ with $C_d = \{ \phi : d \in H_\phi \}$. Now imagine a large set of (imaginary) observations $\{D_i\}$, taken under similar circumstances to $D$. i.e. They are samples from $s(d|\theta)$. Since $H_\theta$ supports probability mass $h$ of the PDF $s(d|\theta)$, $P(D_i \in H_\theta) = h$ for all $i$. Therefore, the fraction of $\{D_i\}$ for which $D_i \in H_\theta$ is $h$. And so, using the equivalence above, the fraction of $\{D_i\}$ for which $\theta \in C_{D_i}$ is also $h$.
This, then, is what the frequentist claim for the $h$ confidence region for $\theta$ amounts to:
The confidence region $C_D$ therefore does not make any claim about the probability that $\theta$ lies somewhere! The reason is simply that there is nothing in the fomulation that allows us to speak of a probability distribution over $\theta$. The interpretation is just elaborate superstructure, which does not improve the base. The base is only $s(d | \theta)$ and $D$, where $\theta$ does not appear as a distributed quantity, and there is no information we can use to address that. There are basically two ways to get a distribution over $\theta$:
In both cases, $\theta$ must appear on the left somewhere. Frequentists cannot use either method, because they both require a heretical prior.
A Bayesian View
The most a Bayesian can make of the $h$ confidence region $C_D$, given without qualification, is simply the direct interpretation: that it is the set of $\phi$ for which $D$ falls in the $h$-HDR $H_\phi$ of the sampling distribution $s(d|\phi)$. It does not necessarily tell us much about $\theta$, and here's why.
The probability that $\theta \in C_D$, given $D$ and the background information $I$, is: \begin{align*} P(\theta \in C_D | DI) &= \int_{C_D} p(\theta | DI) d\theta \\ &= \int_{C_D} \frac{p(D | \theta I) p(\theta | I)}{p(D | I)} d\theta \end{align*} Notice that, unlike the frequentist interpretation, we have immediately demanded a distribution over $\theta$. The background information $I$ tells us, as before, that the sampling distribution is $s(d | \theta)$: \begin{align*} P(\theta \in C_D | DI) &= \int_{C_D} \frac{s(D | \theta) p(\theta | I)}{p(D | I)} d \theta \\ &= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{p(D | I)} \\ \text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{\int s(D | \theta) p(\theta | I) d\theta} \end{align*} Now this expression does not in general evaluate to $h$, which is to say, the $h$ confidence region $C_D$ does not always contain $\theta$ with probability $h$. In fact it can be starkly different from $h$. There are, however, many common situations in which it does evaluate to $h$, which is why confidence regions are often consistent with our probabilistic intuitions.
For example, suppose that the prior joint PDF of $d$ and $\theta$ is symmetric in that $p_{d,\theta}(d,\theta | I) = p_{d,\theta}(\theta,d | I)$. (Clearly this involves an assumption that the PDF ranges over the same domain in $d$ and $\theta$.) Then, if the prior is $p(\theta | I) = f(\theta)$, we have $s(D | \theta) p(\theta | I) = s(D | \theta) f(\theta) = s(\theta | D) f(D)$. Hence \begin{align*} P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(\theta | D) d\theta}{\int s(\theta | D) d\theta} \\ \text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \int_{C_D} s(\theta | D) d\theta \end{align*} From the definition of an HDR we know that for any $\psi \in \Theta$ \begin{align*} \int_{H_\psi} s(d | \psi) dd &= h \\ \text{and therefore that} \quad\quad \int_{H_D} s(d | D) dd &= h \\ \text{or equivalently} \quad\quad \int_{H_D} s(\theta | D) d\theta &= h \end{align*} Therefore, given that $s(d | \theta) f(\theta) = s(\theta | d) f(d)$, $C_D = H_D$ implies $P(\theta \in C_D | DI) = h$. The antecedent satisfies $$ C_D = H_D \longleftrightarrow \forall \psi \; [ \psi \in C_D \leftrightarrow \psi \in H_D ] $$ Applying the equivalence near the top: $$ C_D = H_D \longleftrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ] $$ Thus, the confidence region $C_D$ contains $\theta$ with probability $h$ if for all possible values $\psi$ of $\theta$, the $h$-HDR of $s(d | \psi)$ contains $D$ if and only if the $h$-HDR of $s(d | D)$ contains $\psi$.
Now the symmetric relation $D \in H_\psi \leftrightarrow \psi \in H_D$ is satisfied for all $\psi$ when $s(\psi + \delta | \psi) = s(D - \delta | D)$ for all $\delta$ that span the support of $s(d | D)$ and $s(d | \psi)$. We can therefore form the following argument:
Let's apply the argument to a confidence interval on the mean of a 1-D normal distribution $(\mu, \sigma)$, given a sample mean $\bar{x}$ from $n$ measurements. We have $\theta = \mu$ and $d = \bar{x}$, so that the sampling distribution is $$ s(d | \theta) = \frac{\sqrt{n}}{\sigma \sqrt{2 \pi}} e^{-\frac{n}{2 \sigma^2} { \left( d - \theta \right) }^2 } $$ Suppose also that we know nothing about $\theta$ before taking the data (except that it's a location parameter) and therefore assign a uniform prior: $f(\theta) = k$. Clearly we now have $s(d | \theta) f(\theta) = s(\theta | d) f(d)$, so the first premise is satisfied. Let $s(d | \theta) = g\left( (d - \theta)^2 \right)$. (i.e. It can be written in that form.) Then \begin{gather*} s(\psi + \delta | \psi) = g \left( (\psi + \delta - \psi)^2 \right) = g(\delta^2) \\ \text{and} \quad\quad s(D - \delta | D) = g \left( (D - \delta - D)^2 \right) = g(\delta^2) \\ \text{so that} \quad\quad \forall \psi \; \forall \delta \; [s(\psi + \delta | \psi) = s(D - \delta | D)] \end{gather*} whereupon the second premise is satisfied. Both premises being true, the eight-point argument leads us to conclude that the probability that $\theta$ lies in the confidence interval $C_D$ is $h$!
We therefore have an amusing irony:
Final Remarks
We have identified conditions (i.e. the two premises) under which the $h$ confidence region does indeed yield probability $h$ that $\theta \in C_D$. A frequentist will baulk at the first premise, because it involves a prior on $\theta$, and this sort of deal-breaker is inescapable on the route to a probability. But for a Bayesian, it is acceptable---nay, essential. These conditions are sufficient but not necessary, so there are many other circumstances under which the Bayesian $P(\theta \in C_D | DI)$ equals $h$. Equally though, there are many circumstances in which $P(\theta \in C_D | DI) \ne h$, especially when the prior information is significant.
We have applied a Bayesian analysis just as a consistent Bayesian would, given the information at hand, including statistics $D$. But a Bayesian, if he possibly can, will apply his methods to the raw measurements instead---to the $\{x_i\}$, rather than $\bar{x}$. Oftentimes, collapsing the raw data into summary statistics $D$ destroys information in the data; and then the summary statistics are incapable of speaking as eloquently as the original data about the parameters $\theta$.