Interpretation of Confidence Interval in terms of Probability

confidence intervalprobabilityrandom variablesstatistical-inference

Let $\hat{P}$ be the random variable of the sample proportion and $p$ be the population parameter.

Let's form a $95\%$ interval estimate by approximating the distribution of $\hat{P}$ to normal

We know $\text{Pr}\left(-1.96<Z<1.96\right)=0.95$ where $Z$ is the standard normal random variable ($Z\sim N\left(0,1\right)$)

We know that $\text{sd}\left(\hat{P}\right)=\sqrt{\frac{p\left(1-p\right)}{n}}$ and $E\left(\hat{P}\right)=p$

Now using that $\text{Pr}\left(-1.96<\frac{\hat{P}-p}{\sqrt{\frac{p\left(1-p\right)}{n}}}<1.96\right)\approx 0.95$, this can be simplified to

$\text{Pr}\left(\hat{P}-1.96\sqrt{\frac{p\left(1-p\right)}{n}}<p<\hat{P}+1.96\sqrt{\frac{p\left(1-p\right)}{n}}\right)\approx 0.95$

How do we interpret the above equation? Does 95% confidence interval mean that there is a 95% probability that the population parameter will lie between those two numbers? Does it mean there is a 95% probability that all of those intervals for values of $\hat{P}$ will contain $p$?

Any help or answers would be greatly appreciated 🙂

Best Answer

Confidence intervals are great for illustrating the difference between epistemic and aleatory uncertainty.

Before you collect your sample, the probability statement is an aleatory statement -- that is, it pertains to the actual, repeated sampling (frequentist) probability that the (still TBD and random) interval will contain the true value of the parameter.

After you collect the sample and form your interval, you no longer have any aleatory uncertainty (we have our sample now and our interval -- all random values are now known). The resulting interval either contains the true parameter or it does not, we just don't know which one is true. So in what sense should we care about this actual interval?

This is where epistemic uncertainty comes in. We know the aleatory/objective probability of the interval containing the true parameter is either 0 or 1. But we don't know which one! Therefore, the uncertainty is no longer in the values themselves, but our knowledge. Given this, the post-sampling "confidence" is an epistemic statement (whereas pre-sample it was an actual probability statement).

So, for a 95% CI, we know that 95% of intervals formed this way will contain the true parameter; therefore, we should lean toward believing this interval contains the true parameter, accepting the fact that 5% of such intervals will actually not contain it (i.e., be misleading).

Bottom line: pre-sampling, confidence is a true/aleatory probability. Post-sampling it cannot be interpreted as a frequentist probability, but it is valid to use Confidence as a measure of how strongly you should believe the interval is accurate.

Related Question