This is a good question. Unfortunately, while the academic forecasting literature is indeed (slowly) moving from an almost exclusive emphasis on point forecasts towards interval forecasts and predictive densities, there has been little work on evaluating interval forecasts. (EDIT: See the bottom of this answer for an update.)
As gung notes, whether or not a given 95% prediction interval contains the true actual is in principle a Bernoulli trial with $p=0.95$. Given enough PIs and realizations, you can in principle test the null hypothesis that the actual coverage probability is in fact at least 95%.
However, you will need to think about statistical power and sample sizes. It's probably best to decide beforehand what kind of deviation from the target coverage is still acceptable (is 92% OK? 90%?), then find the minimum sample size needed to detect a deviation that is this strong or stronger with a given probability, say $\beta=80\%$, which is standard. You can do this by a straightforward simulation: simulate $n$ Bernoulli trials with $p=0.92$, estimate $\hat{p}$ with a confidence interval for it, see whether it contains the value 95%, do this "often", and tweak $n$ until 95% is outside the CI in $\beta=80\%$ of your cases. Or use any Bernoulli power calculator.
OK, now that we have our sample size, you can batch your PIs and realizations in batches of this sample size, see how often your PIs contain the true realization, and start testing. Your batches can be the last $n$ PI/realizations of a single time series, or all the latest PIs/realizations of a large number of time series you are forecasting, or whatever.
This approach has the advantage of being rather easy to explain and to understand. Of course, if you have a large number of trials, even small deviations from the target coverage will be statistically significant, which is why you'll need to think about what deviation actually is significant from a business perspective, as per above.
Alternatively, quantile forecasts (say, a 2.5% and a 97.5% quantile forecast, to yield a 95% PI) arise naturally as optimal point forecasts under certain loss functions, which are parameterized based on the target quantile. This paper gives a nice overview. This may be an alternative to the Bernoulli tests above: find the correct loss function for your target upper and lower quantile, then evaluate the two endpoints of your PIs under these loss functions. However, the loss functions are rather abstract and not easily understood, especially for nontechnical audiences.
If you are comparing, say, multiple forecasting methods, you could first discard those whose PIs significantly underperform, based on Bernoulli hypothesis tests or loss functions, then assess the ones that passed this initial screening based on the width of their PIs. Among two PIs with the same correct coverage rate, the narrower one is usually better.
For a simple evaluation of PIs using null hypothesis significance tests, see this paper. There are also some far more elaborate schemes for evaluating PIs, which can also deal with serial dependence in deviations in coverage (maybe your financial PIs are good some part of the year, but bad at specific times), like this paper and that paper. Unfortunately, these require quite a large number of PI/realizations and so are likely only relevant for high-frequency financial data, like stock prices reported multiple times per day.
Finally, there has recently been some interest in going beyond PIs to the underlying predictive densities, which can be evaluated using (proper) scoring rules. Tilmann Gneiting has been very active in this area, and this paper of his gives a good introduction. However, even if you do decide to go deeper into predictive densities, scoring rules are again quite abstract and hard to communicate to a nontechnical audience.
EDIT - an update:
Your quality measure needs to balance coverage and length of the prediction intervals: yes, we want high coverage, but we also want short intervals.
There is a quality measure that does precisely this and has attractive properties: the interval score. Let $\ell$ and $u$ be the lower and the upper end of the prediction interval. The score is given by
$$ S(\ell,u,h) = (u-\ell)+\frac{2}{\alpha}(\ell-h)1(h<\ell)+\frac{2}{\alpha}(h-u)1(h>u). $$
Here $1$ is the indicator function, and $\alpha$ is the coverage your algorithm is aiming for. (You will need to prespecify this, based on what you plan on doing with the prediction interval. It makes no sense to aim for $\alpha=100\%$ coverage, because the resulting intervals will be too wide to be useful for anything.)
You can then average the interval score over many predictions. The lower the average score, the better. See Gneiting & Raftery (2007, JASA)] for a discussion and pointers to further literature. A scaled version of this score was used, for instance, in assessing predictions intervals in the recent M4 forecasting competition.
(Full disclosure: this was shamelessly cribbed from this answer of mine.)
The method laid out below is the one
described in Section 6.3.3 of Davidson and Hinckley (1997),
Bootstrap Methods and Their Application. Thanks to Glen_b and his
comment here. Given that there were several questions on Cross Validated on this topic, I thought it was worth writing up.
The linear regression model is:
\begin{align}
Y_i &= X_i\beta+\epsilon_i
\end{align}
We have data, $i=1,2,\ldots,N$, which we use to estimate the $\beta$ as:
\begin{align}
\hat{\beta}_{\text{OLS}} &= \left( X'X \right)^{-1}X'Y
\end{align}
Now, we want to predict what $Y$ will be for a new data point, given that we know $X$ for it. This is the prediction problem. Let's call the new $X$ (which we know) $X_{N+1}$ and the new $Y$ (which we would like to predict), $Y_{N+1}$. The usual prediction (if we are assuming that the $\epsilon_i$ are iid and uncorrelated with $X$) is:
\begin{align}
Y^p_{N+1} &= X_{N+1}\hat{\beta}_{\text{OLS}}
\end{align}
The forecast error made by this prediction is:
\begin{align}
e^p_{N+1} &= Y_{N+1}-Y^p_{N+1}
\end{align}
We can re-write this equation like:
\begin{align}
Y_{N+1} &= Y^p_{N+1} + e^p_{N+1}
\end{align}
Now, $Y^p_{N+1}$ we have already calculated. So, if we want to bound $Y_{N+1}$ in an interval, say, 90% of the time, all we need to do is estimate consistently the $5^{th}$ and $95^{th}$ percentiles/quantiles of $e^p_{N+1}$, call them $e^5,e^{95}$, and the prediction interval will be $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.
How to estimate the quantiles/percentiles of $e^p_{N+1}$? Well, we can write:
\begin{align}
e^p_{N+1} &= Y_{N+1}-Y^p_{N+1}\\
&= X_{N+1}\beta + \epsilon_{N+1} - X_{N+1}\hat{\beta}_{\text{OLS}}\\
&= X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right) + \epsilon_{N+1}
\end{align}
The strategy will be to sample (in a bootstrap kind of way) many times from $e^p_{N+1}$ and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from $e^p_{N+1}$, and then estimate the $5^{th}$ and $95^{th}$ percentiles as the $500^{th}$ and $9,500^{th}$ smallest members of the sample.
To draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$, we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw $N$ times with replacement from the variance-adjusted residuals (see next para) to get $\epsilon^*_i$, then make new $Y^*_i=X_i\hat{\beta}_{\text{OLS}}+\epsilon^*_i$, then run OLS on the new dataset, $\left(Y^*,X \right)$ to get this replication's $\beta^*_r$. At last, this replication's draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$ is $X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)$
Given we are assuming iid $\epsilon$, the natural way to sample from the $\epsilon_{N+1}$ part of the equation is to use the residuals we have from the regression, $\left\{ e^*_1,e^*_2,\ldots,e^*_N \right\}$. Residuals have different and generally too small variances, so we will want to sample from $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s} \right\}$, the variance-corrected residuals, where $s_i=e^*_i/\sqrt{(1-h_i)}$ and $h_i$ is the leverage of observation $i$.
And, finally, the algorithm for making a 90% prediction interval for $Y_{N+1}$, given that $X$ is $X_{N+1}$ is:
- Make the prediction $Y^p_{N+1}=X_{N+1}\hat{\beta}_{\text{OLS}}$.
- Make the variance-adjusted residuals, $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s}\right\}$, where $s_i=e_i/\sqrt(1-h_{i})$.
- For replications $r=1,2,\ldots,R$:
- Draw $N$ times on the adjusted residuals
to make bootstrap residuals
$\left\{\epsilon^*_1,\epsilon^*_2,\ldots,\epsilon^*_N \right\}$
- Generate bootstrap $Y^*=X\hat{\beta}_{\text{OLS}}+\epsilon^*$
- Calculate bootstrap OLS estimator for this replication,
$\beta^*_r=\left( X'X \right)^{-1}X'Y^*$
- Obtain bootstrap residuals from this replication, $e^*_r=Y^*-X\beta^*_r$
- Calculate bootstrap variance-adjusted residuals from this
replication, $s^*-\overline{s^*}$
- Draw one of the bootstrap variance-adjusted residuals from this
replication, $\epsilon^*_{N+1,r}$
- Calculate this replication's draw on
$e^p_{N+1}$, $e^{p*}_r=X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r
\right)+\epsilon^*_{N+1,r}$
- Find $5^{th}$ and $95^{th}$ percentiles of $e^p_{N+1}$, $e^5,e^{95}$
- 90% prediction interval for $Y_{N+1}$ is
$\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.
Here is R
code:
# This script gives an example of the procedure to construct a prediction interval
# for a linear regression model using a bootstrap method. The method is the one
# described in Section 6.3.3 of Davidson and Hinckley (1997),
# _Bootstrap Methods and Their Application_.
#rm(list=ls())
set.seed(12344321)
library(MASS)
library(Hmisc)
# Generate bivariate regression data
x <- runif(n=100,min=0,max=100)
y <- 1 + x + (rexp(n=100,rate=0.25)-4)
my.reg <- lm(y~x)
summary(my.reg)
# Predict y for x=78:
y.p <- coef(my.reg)["(Intercept)"] + coef(my.reg)["x"]*78
y.p
# Create adjusted residuals
leverage <- influence(my.reg)$hat
my.s.resid <- residuals(my.reg)/sqrt(1-leverage)
my.s.resid <- my.s.resid - mean(my.s.resid)
reg <- my.reg
s <- my.s.resid
the.replication <- function(reg,s,x_Np1=0){
# Make bootstrap residuals
ep.star <- sample(s,size=length(reg$residuals),replace=TRUE)
# Make bootstrap Y
y.star <- fitted(reg)+ep.star
# Do bootstrap regression
x <- model.frame(reg)[,2]
bs.reg <- lm(y.star~x)
# Create bootstrapped adjusted residuals
bs.lev <- influence(bs.reg)$hat
bs.s <- residuals(bs.reg)/sqrt(1-bs.lev)
bs.s <- bs.s - mean(bs.s)
# Calculate draw on prediction error
xb.xb <- coef(my.reg)["(Intercept)"] - coef(bs.reg)["(Intercept)"]
xb.xb <- xb.xb + (coef(my.reg)["x"] - coef(bs.reg)["x"])*x_Np1
return(unname(xb.xb + sample(bs.s,size=1)))
}
# Do bootstrap with 10,000 replications
ep.draws <- replicate(n=10000,the.replication(reg=my.reg,s=my.s.resid,x_Np1=78))
# Create prediction interval
y.p+quantile(ep.draws,probs=c(0.05,0.95))
# prediction interval using normal assumption
predict(my.reg,newdata=data.frame(x=78),interval="prediction",level=0.90)
# Quick and dirty Monte Carlo to see which prediction interval is better
# That is, what are the 5th and 95th percentiles of Y_{N+1}
#
# To do it properly, I guess we would want to do the whole procedure above
# 10,000 times and then see what percentage of the time each prediction
# interval covered Y_{N+1}
y.np1 <- 1 + 78 + (rexp(n=10000,rate=0.25)-4)
quantile(y.np1,probs=c(0.05,0.95))
Best Answer
You've got two options, I think.
Generate 100 synthetic data-sets by sampling with replacement from the original data-set. Run the knn regression over each new data-set and sort the point predictions. The confidence interval is just the distance between the 5th and 95th point prediction.
Basically you either use a pooled variance estimator (if you have multiple observations at the same $x$) or pseudo-residuals to get an estimate of the variance. Assuming homoskedastic and normal error you can use the t-distribution such that:
$ \bar y_i \pm t(h,\alpha) \frac{\sigma}{\sqrt{n_i}}$
Where $\bar y$ is the average predicted, $h = \frac{n-2}{n}$ is the degrees of freedome of the t-distribution and $n_i$ is the number of points in the neighborhood.
You can read more about it here