Solved – Confidence interval vs. prediction interval misunderstanding

bootstrapconfidence intervalnormality-assumptionprediction intervalrms

Problem

I have a time series data set with about 50 observations. I'd like to compute an interval that may contain the next/future value in the time series (the 51st data point). I tried using a 90% t-confidence interval (data isn't that normal) for this, so I calculated the mean, standard deviation, etc. However, the resulting interval captured less than 50% of the sample. That's not a very reassuring result given that it is a 90% CI and it doesn't give me much confidence on the ability of the interval to contain the next value observed in the time series.

After reading more about CI…

I started realizing that expecting a 90% confidence interval to contain about 90% of the sample is a popular misunderstanding because the confidence interval is a statement about the population statistic. Also, the "statistic" that my interval is discussing is the mean. However, this got me wondering if using a confidence interval to solve my problem even makes sense. I computed a 90% confidence interval around the mean of my data set but what I need is an interval that captures the next value in my data set. I believe those are two different things.

Questions

  1. Is there another method that's more appropriate? I saw something about using the RMSE instead of standard deviation in a confidence interval and adding the 90% t-value and RMSE based margin of error to the mean. I also saw the "prediction interval" method. Would bootstrapping be helpful? What is best? What sort of assumptions would be made about the data?

  2. Why doesn't a 90% CI capture at least 90% of the sample, mathematically speaking?

Best Answer

Although I don’t have the perfect answer to your question, but apropos of the popular misconception - a 90% CI for the mean simply implies that “the population mean is likely to lie in the said interval in 90% of the samples”. Expanded - a X% Confidence Interval (where X could be 90%, 95% or 99% etc.) means that in X% of random samples drawn from the distribution, the estimated mean will lie in the stated interval. It does not mean that X% of the population lies in the CI.