Solved – Prediction intervals for kNN regression

k nearest neighbourprediction intervalregression

I would like compute prediction intervals for predictions made by kNN regression. I can't find any explicit reference to confirm, so my question is – is this approach to computing prediction intervals correct?

I have a reference dataset where each row is one location (e.g. city).
I have two features (say, x1 and x2), describing a sample from the population of that location (e.g. x1 could be the average income of the residents). Sample size is different for each location. I predict a target variable (say, y, e.g. the total number of cars in that city) based on x1 and x2.

A prediction for a new location Z is made by finding k nearest neighbors of Z in terms of x1 and x2 (the Euclidean distance), and averaging over the target variable of those k neighbors.

I compute prediction intervals as y* +- t*s, where s is the standard deviation of the target among k nearest neighbors, and t comes from the standard normal distribution (e.g. for 95% prediction interval t=1.96). I ignore x1 and x2, and I ignore the fact that x1 and x2 are estimated over different samples.
Does the approach make sense?

Best Answer

You've got two options, I think.

Bootstrap

Generate 100 synthetic data-sets by sampling with replacement from the original data-set. Run the knn regression over each new data-set and sort the point predictions. The confidence interval is just the distance between the 5th and 95th point prediction.

Pseudo-Residuals

Basically you either use a pooled variance estimator (if you have multiple observations at the same $x$) or pseudo-residuals to get an estimate of the variance. Assuming homoskedastic and normal error you can use the t-distribution such that:
$ \bar y_i \pm t(h,\alpha) \frac{\sigma}{\sqrt{n_i}}$
Where $\bar y$ is the average predicted, $h = \frac{n-2}{n}$ is the degrees of freedome of the t-distribution and $n_i$ is the number of points in the neighborhood.

You can read more about it here

Related Solutions

Time Series – Forecast Accuracy Metric That Involves Prediction Intervals

This is a good question. Unfortunately, while the academic forecasting literature is indeed (slowly) moving from an almost exclusive emphasis on point forecasts towards interval forecasts and predictive densities, there has been little work on evaluating interval forecasts. (EDIT: See the bottom of this answer for an update.)

As gung notes, whether or not a given 95% prediction interval contains the true actual is in principle a Bernoulli trial with $p=0.95$. Given enough PIs and realizations, you can in principle test the null hypothesis that the actual coverage probability is in fact at least 95%.

However, you will need to think about statistical power and sample sizes. It's probably best to decide beforehand what kind of deviation from the target coverage is still acceptable (is 92% OK? 90%?), then find the minimum sample size needed to detect a deviation that is this strong or stronger with a given probability, say $\beta=80\%$, which is standard. You can do this by a straightforward simulation: simulate $n$ Bernoulli trials with $p=0.92$, estimate $\hat{p}$ with a confidence interval for it, see whether it contains the value 95%, do this "often", and tweak $n$ until 95% is outside the CI in $\beta=80\%$ of your cases. Or use any Bernoulli power calculator.

OK, now that we have our sample size, you can batch your PIs and realizations in batches of this sample size, see how often your PIs contain the true realization, and start testing. Your batches can be the last $n$ PI/realizations of a single time series, or all the latest PIs/realizations of a large number of time series you are forecasting, or whatever.

This approach has the advantage of being rather easy to explain and to understand. Of course, if you have a large number of trials, even small deviations from the target coverage will be statistically significant, which is why you'll need to think about what deviation actually is significant from a business perspective, as per above.

Alternatively, quantile forecasts (say, a 2.5% and a 97.5% quantile forecast, to yield a 95% PI) arise naturally as optimal point forecasts under certain loss functions, which are parameterized based on the target quantile. This paper gives a nice overview. This may be an alternative to the Bernoulli tests above: find the correct loss function for your target upper and lower quantile, then evaluate the two endpoints of your PIs under these loss functions. However, the loss functions are rather abstract and not easily understood, especially for nontechnical audiences.

If you are comparing, say, multiple forecasting methods, you could first discard those whose PIs significantly underperform, based on Bernoulli hypothesis tests or loss functions, then assess the ones that passed this initial screening based on the width of their PIs. Among two PIs with the same correct coverage rate, the narrower one is usually better.

For a simple evaluation of PIs using null hypothesis significance tests, see this paper. There are also some far more elaborate schemes for evaluating PIs, which can also deal with serial dependence in deviations in coverage (maybe your financial PIs are good some part of the year, but bad at specific times), like this paper and that paper. Unfortunately, these require quite a large number of PI/realizations and so are likely only relevant for high-frequency financial data, like stock prices reported multiple times per day.

Finally, there has recently been some interest in going beyond PIs to the underlying predictive densities, which can be evaluated using (proper) scoring rules. Tilmann Gneiting has been very active in this area, and this paper of his gives a good introduction. However, even if you do decide to go deeper into predictive densities, scoring rules are again quite abstract and hard to communicate to a nontechnical audience.

EDIT - an update:

Your quality measure needs to balance coverage and length of the prediction intervals: yes, we want high coverage, but we also want short intervals.

There is a quality measure that does precisely this and has attractive properties: the interval score. Let $\ell$ and $u$ be the lower and the upper end of the prediction interval. The score is given by

$$ S(\ell,u,h) = (u-\ell)+\frac{2}{\alpha}(\ell-h)1(h<\ell)+\frac{2}{\alpha}(h-u)1(h>u). $$

Here $1$ is the indicator function, and $\alpha$ is the coverage your algorithm is aiming for. (You will need to prespecify this, based on what you plan on doing with the prediction interval. It makes no sense to aim for $\alpha=100\%$ coverage, because the resulting intervals will be too wide to be useful for anything.)

You can then average the interval score over many predictions. The lower the average score, the better. See Gneiting & Raftery (2007, JASA)] for a discussion and pointers to further literature. A scaled version of this score was used, for instance, in assessing predictions intervals in the recent M4 forecasting competition.

(Full disclosure: this was shamelessly cribbed from this answer of mine.)

Solved – Bootstrap prediction interval

The method laid out below is the one described in Section 6.3.3 of Davidson and Hinckley (1997), Bootstrap Methods and Their Application. Thanks to Glen_b and his comment here. Given that there were several questions on Cross Validated on this topic, I thought it was worth writing up.

The linear regression model is: \begin{align} Y_i &= X_i\beta+\epsilon_i \end{align}

We have data, $i=1,2,\ldots,N$, which we use to estimate the $\beta$ as: \begin{align} \hat{\beta}_{\text{OLS}} &= \left( X'X \right)^{-1}X'Y \end{align}

Now, we want to predict what $Y$ will be for a new data point, given that we know $X$ for it. This is the prediction problem. Let's call the new $X$ (which we know) $X_{N+1}$ and the new $Y$ (which we would like to predict), $Y_{N+1}$. The usual prediction (if we are assuming that the $\epsilon_i$ are iid and uncorrelated with $X$) is: \begin{align} Y^p_{N+1} &= X_{N+1}\hat{\beta}_{\text{OLS}} \end{align}

The forecast error made by this prediction is: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1} \end{align}

We can re-write this equation like: \begin{align} Y_{N+1} &= Y^p_{N+1} + e^p_{N+1} \end{align}

Now, $Y^p_{N+1}$ we have already calculated. So, if we want to bound $Y_{N+1}$ in an interval, say, 90% of the time, all we need to do is estimate consistently the $5^{th}$ and $95^{th}$ percentiles/quantiles of $e^p_{N+1}$, call them $e^5,e^{95}$, and the prediction interval will be $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.

How to estimate the quantiles/percentiles of $e^p_{N+1}$? Well, we can write: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1}\\ &= X_{N+1}\beta + \epsilon_{N+1} - X_{N+1}\hat{\beta}_{\text{OLS}}\\ &= X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right) + \epsilon_{N+1} \end{align}

The strategy will be to sample (in a bootstrap kind of way) many times from $e^p_{N+1}$ and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from $e^p_{N+1}$, and then estimate the $5^{th}$ and $95^{th}$ percentiles as the $500^{th}$ and $9,500^{th}$ smallest members of the sample.

To draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$, we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw $N$ times with replacement from the variance-adjusted residuals (see next para) to get $\epsilon^*_i$, then make new $Y^*_i=X_i\hat{\beta}_{\text{OLS}}+\epsilon^*_i$, then run OLS on the new dataset, $\left(Y^*,X \right)$ to get this replication's $\beta^*_r$. At last, this replication's draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$ is $X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)$

Given we are assuming iid $\epsilon$, the natural way to sample from the $\epsilon_{N+1}$ part of the equation is to use the residuals we have from the regression, $\left\{ e^*_1,e^*_2,\ldots,e^*_N \right\}$. Residuals have different and generally too small variances, so we will want to sample from $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s} \right\}$, the variance-corrected residuals, where $s_i=e^*_i/\sqrt{(1-h_i)}$ and $h_i$ is the leverage of observation $i$.

And, finally, the algorithm for making a 90% prediction interval for $Y_{N+1}$, given that $X$ is $X_{N+1}$ is:

Make the prediction $Y^p_{N+1}=X_{N+1}\hat{\beta}_{\text{OLS}}$.
Make the variance-adjusted residuals, $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s}\right\}$, where $s_i=e_i/\sqrt(1-h_{i})$.
For replications $r=1,2,\ldots,R$:
- Draw $N$ times on the adjusted residuals to make bootstrap residuals $\left\{\epsilon^*_1,\epsilon^*_2,\ldots,\epsilon^*_N \right\}$
- Generate bootstrap $Y^*=X\hat{\beta}_{\text{OLS}}+\epsilon^*$
- Calculate bootstrap OLS estimator for this replication, $\beta^*_r=\left( X'X \right)^{-1}X'Y^*$
- Obtain bootstrap residuals from this replication, $e^*_r=Y^*-X\beta^*_r$
- Calculate bootstrap variance-adjusted residuals from this replication, $s^*-\overline{s^*}$
- Draw one of the bootstrap variance-adjusted residuals from this replication, $\epsilon^*_{N+1,r}$
- Calculate this replication's draw on $e^p_{N+1}$, $e^{p*}_r=X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)+\epsilon^*_{N+1,r}$
Find $5^{th}$ and $95^{th}$ percentiles of $e^p_{N+1}$, $e^5,e^{95}$
90% prediction interval for $Y_{N+1}$ is $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.

Here is R code:

# This script gives an example of the procedure to construct a prediction interval
# for a linear regression model using a bootstrap method.  The method is the one
# described in Section 6.3.3 of Davidson and Hinckley (1997),
# _Bootstrap Methods and Their Application_.


#rm(list=ls())
set.seed(12344321)
library(MASS)
library(Hmisc)

# Generate bivariate regression data
x <- runif(n=100,min=0,max=100)
y <- 1 + x + (rexp(n=100,rate=0.25)-4)

my.reg <- lm(y~x)
summary(my.reg)

# Predict y for x=78:
y.p <- coef(my.reg)["(Intercept)"] + coef(my.reg)["x"]*78
y.p

# Create adjusted residuals
leverage <- influence(my.reg)$hat
my.s.resid <- residuals(my.reg)/sqrt(1-leverage)
my.s.resid <- my.s.resid - mean(my.s.resid)


reg <- my.reg
s <- my.s.resid

the.replication <- function(reg,s,x_Np1=0){
  # Make bootstrap residuals
  ep.star <- sample(s,size=length(reg$residuals),replace=TRUE)

  # Make bootstrap Y
  y.star <- fitted(reg)+ep.star

  # Do bootstrap regression
  x <- model.frame(reg)[,2]
  bs.reg <- lm(y.star~x)

  # Create bootstrapped adjusted residuals
  bs.lev <- influence(bs.reg)$hat
  bs.s   <- residuals(bs.reg)/sqrt(1-bs.lev)
  bs.s   <- bs.s - mean(bs.s)

  # Calculate draw on prediction error
  xb.xb <- coef(my.reg)["(Intercept)"] - coef(bs.reg)["(Intercept)"] 
  xb.xb <- xb.xb + (coef(my.reg)["x"] - coef(bs.reg)["x"])*x_Np1
  return(unname(xb.xb + sample(bs.s,size=1)))
}

# Do bootstrap with 10,000 replications
ep.draws <- replicate(n=10000,the.replication(reg=my.reg,s=my.s.resid,x_Np1=78))

# Create prediction interval
y.p+quantile(ep.draws,probs=c(0.05,0.95))

# prediction interval using normal assumption
predict(my.reg,newdata=data.frame(x=78),interval="prediction",level=0.90)


# Quick and dirty Monte Carlo to see which prediction interval is better
# That is, what are the 5th and 95th percentiles of Y_{N+1}
# 
# To do it properly, I guess we would want to do the whole procedure above
# 10,000 times and then see what percentage of the time each prediction 
# interval covered Y_{N+1}

y.np1 <- 1 + 78 + (rexp(n=10000,rate=0.25)-4)
quantile(y.np1,probs=c(0.05,0.95))

Best Answer

Related Solutions

Time Series – Forecast Accuracy Metric That Involves Prediction Intervals

Solved – Bootstrap prediction interval

Related Question