Is there any bootstrap technique available to compute prediction intervals for point predictions obtained e.g. from linear regression or other regression method (k-nearest neighbour, regression trees etc.)?
Somehow I feel that the sometimes proposed way to just bootsrap the point prediction (see e.g. Prediction intervals for kNN regression) is not providing a prediction interval but a confidence interval.
An example in R
# STEP 1: GENERATE DATA
set.seed(34345)
n <- 100
x <- runif(n)
y <- 1 + 0.2*x + rnorm(n)
data <- data.frame(x, y)
# STEP 2: COMPUTE CLASSIC 95%-PREDICTION INTERVAL
fit <- lm(y ~ x)
plot(fit) # not shown but looks fine with respect to all relevant aspects
# Classic prediction interval based on standard error of forecast
predict(fit, list(x = 0.1), interval = "p")
# -0.6588168 3.093755
# Classic confidence interval based on standard error of estimation
predict(fit, list(x = 0.1), interval = "c")
# 0.893388 1.54155
# STEP 3: NOW BY BOOTSTRAP
B <- 1000
pred <- numeric(B)
for (i in 1:B) {
boot <- sample(n, n, replace = TRUE)
fit.b <- lm(y ~ x, data = data[boot,])
pred[i] <- predict(fit.b, list(x = 0.1))
}
quantile(pred, c(0.025, 0.975))
# 0.8699302 1.5399179
Obviously, the 95% basic bootstrap interval matches the 95% confidence interval, not the 95% prediction interval. So my question: How to do it properly?
Best Answer
The method laid out below is the one described in Section 6.3.3 of Davidson and Hinckley (1997), Bootstrap Methods and Their Application. Thanks to Glen_b and his comment here. Given that there were several questions on Cross Validated on this topic, I thought it was worth writing up.
The linear regression model is: \begin{align} Y_i &= X_i\beta+\epsilon_i \end{align}
We have data, $i=1,2,\ldots,N$, which we use to estimate the $\beta$ as: \begin{align} \hat{\beta}_{\text{OLS}} &= \left( X'X \right)^{-1}X'Y \end{align}
Now, we want to predict what $Y$ will be for a new data point, given that we know $X$ for it. This is the prediction problem. Let's call the new $X$ (which we know) $X_{N+1}$ and the new $Y$ (which we would like to predict), $Y_{N+1}$. The usual prediction (if we are assuming that the $\epsilon_i$ are iid and uncorrelated with $X$) is: \begin{align} Y^p_{N+1} &= X_{N+1}\hat{\beta}_{\text{OLS}} \end{align}
The forecast error made by this prediction is: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1} \end{align}
We can re-write this equation like: \begin{align} Y_{N+1} &= Y^p_{N+1} + e^p_{N+1} \end{align}
Now, $Y^p_{N+1}$ we have already calculated. So, if we want to bound $Y_{N+1}$ in an interval, say, 90% of the time, all we need to do is estimate consistently the $5^{th}$ and $95^{th}$ percentiles/quantiles of $e^p_{N+1}$, call them $e^5,e^{95}$, and the prediction interval will be $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.
How to estimate the quantiles/percentiles of $e^p_{N+1}$? Well, we can write: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1}\\ &= X_{N+1}\beta + \epsilon_{N+1} - X_{N+1}\hat{\beta}_{\text{OLS}}\\ &= X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right) + \epsilon_{N+1} \end{align}
The strategy will be to sample (in a bootstrap kind of way) many times from $e^p_{N+1}$ and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from $e^p_{N+1}$, and then estimate the $5^{th}$ and $95^{th}$ percentiles as the $500^{th}$ and $9,500^{th}$ smallest members of the sample.
To draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$, we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw $N$ times with replacement from the variance-adjusted residuals (see next para) to get $\epsilon^*_i$, then make new $Y^*_i=X_i\hat{\beta}_{\text{OLS}}+\epsilon^*_i$, then run OLS on the new dataset, $\left(Y^*,X \right)$ to get this replication's $\beta^*_r$. At last, this replication's draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$ is $X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)$
Given we are assuming iid $\epsilon$, the natural way to sample from the $\epsilon_{N+1}$ part of the equation is to use the residuals we have from the regression, $\left\{ e^*_1,e^*_2,\ldots,e^*_N \right\}$. Residuals have different and generally too small variances, so we will want to sample from $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s} \right\}$, the variance-corrected residuals, where $s_i=e^*_i/\sqrt{(1-h_i)}$ and $h_i$ is the leverage of observation $i$.
And, finally, the algorithm for making a 90% prediction interval for $Y_{N+1}$, given that $X$ is $X_{N+1}$ is:
Here is
R
code: