Solved – Prediction intervals for kNN regression

k nearest neighbourprediction intervalregression

I would like compute prediction intervals for predictions made by kNN regression. I can't find any explicit reference to confirm, so my question is – is this approach to computing prediction intervals correct?

I have a reference dataset where each row is one location (e.g. city).
I have two features (say, x1 and x2), describing a sample from the population of that location (e.g. x1 could be the average income of the residents). Sample size is different for each location. I predict a target variable (say, y, e.g. the total number of cars in that city) based on x1 and x2.

A prediction for a new location Z is made by finding k nearest neighbors of Z in terms of x1 and x2 (the Euclidean distance), and averaging over the target variable of those k neighbors.

I compute prediction intervals as y* +- t*s, where s is the standard deviation of the target among k nearest neighbors, and t comes from the standard normal distribution (e.g. for 95% prediction interval t=1.96). I ignore x1 and x2, and I ignore the fact that x1 and x2 are estimated over different samples.
Does the approach make sense?

Best Answer

You've got two options, I think.

  1. Bootstrap

Generate 100 synthetic data-sets by sampling with replacement from the original data-set. Run the knn regression over each new data-set and sort the point predictions. The confidence interval is just the distance between the 5th and 95th point prediction.

  1. Pseudo-Residuals

Basically you either use a pooled variance estimator (if you have multiple observations at the same $x$) or pseudo-residuals to get an estimate of the variance. Assuming homoskedastic and normal error you can use the t-distribution such that:
$ \bar y_i \pm t(h,\alpha) \frac{\sigma}{\sqrt{n_i}}$
Where $\bar y$ is the average predicted, $h = \frac{n-2}{n}$ is the degrees of freedome of the t-distribution and $n_i$ is the number of points in the neighborhood.

You can read more about it here

Related Question