[Math] confidence intervals – a bad confidence interval

statistics

I had a question for confidence intervals:

the situation in the question :so we have a number of scatter plots with each showing an
estimated regression line (based on a valid model) and associated individual 95% con fidence intervals (CI) for the regression function at each x-value, as well as the observed data. A professor asks 'I don't understand how 95% of the observations fall outside the 95% CI as depicted in the figures'.
Briefly explain how is is entirely possible that 95% of the observations fall outside the 95% CI as depicted in the figures.(We weren't given actual figures)

Anyway I thought that it may have been due to the fact that a lot of outliers affected the regression line calculated, and so a confidence interval formed from a bad regression line would be bad – resulting in 95% of observations falling outside the 95% CI.

I guess a good one would look like this: https://stats.stackexchange.com/questions/47563/plotting-the-fitted-values-and-their-confidence-intervals, where most observations are inside the confidence interval.

I also considered that some gauss assumptions were violated; such as the zero conditional mean assumption. So then the coefficients and their standard errors would be invalid, resulting in an invalid confidence interval. Still, 95% of observations falling outside the CI seems ridiculous.

Does anyone know the real reason(s) why this could be the case?

Best Answer

In a classical frequentest setting, the probability statements regarding a confidence interval relates to the (random) bounds of the interval. For example, take the common confidence interval for the mean, $\mu$, of some normal data generating process. We have $$P\left(\overline{y} - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \overline{y} + 1.96 \frac{\sigma}{\sqrt{n}}\right) = 0.95$$

Notice that $\mu$ is not treated as random, it is 'fixed' as there is only one true mean. The probability statements we make corresponds to the lower and upper bounds of the interval, that is, $\overline{y} \pm 1.96 \frac{\sigma}{\sqrt{n}}$, since these bounds depend on $\bar{y}$ (let's for the moment assume we know $\sigma$), then it could be entirely possibly (due to sheer 'luck') for a specific sample, we obtain a value for $\bar{y}$ that results in the entire interval lying completely outside the majority of the observations. However, what the confidence interval does say is that during repeated sampling, 95% of the times we should expect to see the interval encapsulating the true mean.

Related Question