Solved – Bootstrap prediction interval

bootstrapprediction interval

Is there any bootstrap technique available to compute prediction intervals for point predictions obtained e.g. from linear regression or other regression method (k-nearest neighbour, regression trees etc.)?

Somehow I feel that the sometimes proposed way to just bootsrap the point prediction (see e.g. Prediction intervals for kNN regression) is not providing a prediction interval but a confidence interval.

An example in R

# STEP 1: GENERATE DATA

set.seed(34345)

n <- 100 
x <- runif(n)
y <- 1 + 0.2*x + rnorm(n)
data <- data.frame(x, y)


# STEP 2: COMPUTE CLASSIC 95%-PREDICTION INTERVAL
fit <- lm(y ~ x)
plot(fit) # not shown but looks fine with respect to all relevant aspects

# Classic prediction interval based on standard error of forecast
predict(fit, list(x = 0.1), interval = "p")
# -0.6588168 3.093755

# Classic confidence interval based on standard error of estimation
predict(fit, list(x = 0.1), interval = "c")
# 0.893388 1.54155


# STEP 3: NOW BY BOOTSTRAP
B <- 1000
pred <- numeric(B)
for (i in 1:B) {
  boot <- sample(n, n, replace = TRUE)
  fit.b <- lm(y ~ x, data = data[boot,])
  pred[i] <- predict(fit.b, list(x = 0.1))
}
quantile(pred, c(0.025, 0.975))
# 0.8699302 1.5399179

Obviously, the 95% basic bootstrap interval matches the 95% confidence interval, not the 95% prediction interval. So my question: How to do it properly?

Best Answer

The method laid out below is the one described in Section 6.3.3 of Davidson and Hinckley (1997), Bootstrap Methods and Their Application. Thanks to Glen_b and his comment here. Given that there were several questions on Cross Validated on this topic, I thought it was worth writing up.

The linear regression model is: \begin{align} Y_i &= X_i\beta+\epsilon_i \end{align}

We have data, $i=1,2,\ldots,N$, which we use to estimate the $\beta$ as: \begin{align} \hat{\beta}_{\text{OLS}} &= \left( X'X \right)^{-1}X'Y \end{align}

Now, we want to predict what $Y$ will be for a new data point, given that we know $X$ for it. This is the prediction problem. Let's call the new $X$ (which we know) $X_{N+1}$ and the new $Y$ (which we would like to predict), $Y_{N+1}$. The usual prediction (if we are assuming that the $\epsilon_i$ are iid and uncorrelated with $X$) is: \begin{align} Y^p_{N+1} &= X_{N+1}\hat{\beta}_{\text{OLS}} \end{align}

The forecast error made by this prediction is: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1} \end{align}

We can re-write this equation like: \begin{align} Y_{N+1} &= Y^p_{N+1} + e^p_{N+1} \end{align}

Now, $Y^p_{N+1}$ we have already calculated. So, if we want to bound $Y_{N+1}$ in an interval, say, 90% of the time, all we need to do is estimate consistently the $5^{th}$ and $95^{th}$ percentiles/quantiles of $e^p_{N+1}$, call them $e^5,e^{95}$, and the prediction interval will be $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.

How to estimate the quantiles/percentiles of $e^p_{N+1}$? Well, we can write: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1}\\ &= X_{N+1}\beta + \epsilon_{N+1} - X_{N+1}\hat{\beta}_{\text{OLS}}\\ &= X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right) + \epsilon_{N+1} \end{align}

The strategy will be to sample (in a bootstrap kind of way) many times from $e^p_{N+1}$ and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from $e^p_{N+1}$, and then estimate the $5^{th}$ and $95^{th}$ percentiles as the $500^{th}$ and $9,500^{th}$ smallest members of the sample.

To draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$, we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw $N$ times with replacement from the variance-adjusted residuals (see next para) to get $\epsilon^*_i$, then make new $Y^*_i=X_i\hat{\beta}_{\text{OLS}}+\epsilon^*_i$, then run OLS on the new dataset, $\left(Y^*,X \right)$ to get this replication's $\beta^*_r$. At last, this replication's draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$ is $X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)$

Given we are assuming iid $\epsilon$, the natural way to sample from the $\epsilon_{N+1}$ part of the equation is to use the residuals we have from the regression, $\left\{ e^*_1,e^*_2,\ldots,e^*_N \right\}$. Residuals have different and generally too small variances, so we will want to sample from $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s} \right\}$, the variance-corrected residuals, where $s_i=e^*_i/\sqrt{(1-h_i)}$ and $h_i$ is the leverage of observation $i$.

And, finally, the algorithm for making a 90% prediction interval for $Y_{N+1}$, given that $X$ is $X_{N+1}$ is:

  1. Make the prediction $Y^p_{N+1}=X_{N+1}\hat{\beta}_{\text{OLS}}$.
  2. Make the variance-adjusted residuals, $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s}\right\}$, where $s_i=e_i/\sqrt(1-h_{i})$.
  3. For replications $r=1,2,\ldots,R$:
    • Draw $N$ times on the adjusted residuals to make bootstrap residuals $\left\{\epsilon^*_1,\epsilon^*_2,\ldots,\epsilon^*_N \right\}$
    • Generate bootstrap $Y^*=X\hat{\beta}_{\text{OLS}}+\epsilon^*$
    • Calculate bootstrap OLS estimator for this replication, $\beta^*_r=\left( X'X \right)^{-1}X'Y^*$
    • Obtain bootstrap residuals from this replication, $e^*_r=Y^*-X\beta^*_r$
    • Calculate bootstrap variance-adjusted residuals from this replication, $s^*-\overline{s^*}$
    • Draw one of the bootstrap variance-adjusted residuals from this replication, $\epsilon^*_{N+1,r}$
    • Calculate this replication's draw on $e^p_{N+1}$, $e^{p*}_r=X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)+\epsilon^*_{N+1,r}$
  4. Find $5^{th}$ and $95^{th}$ percentiles of $e^p_{N+1}$, $e^5,e^{95}$
  5. 90% prediction interval for $Y_{N+1}$ is $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.

Here is R code:

# This script gives an example of the procedure to construct a prediction interval
# for a linear regression model using a bootstrap method.  The method is the one
# described in Section 6.3.3 of Davidson and Hinckley (1997),
# _Bootstrap Methods and Their Application_.


#rm(list=ls())
set.seed(12344321)
library(MASS)
library(Hmisc)

# Generate bivariate regression data
x <- runif(n=100,min=0,max=100)
y <- 1 + x + (rexp(n=100,rate=0.25)-4)

my.reg <- lm(y~x)
summary(my.reg)

# Predict y for x=78:
y.p <- coef(my.reg)["(Intercept)"] + coef(my.reg)["x"]*78
y.p

# Create adjusted residuals
leverage <- influence(my.reg)$hat
my.s.resid <- residuals(my.reg)/sqrt(1-leverage)
my.s.resid <- my.s.resid - mean(my.s.resid)


reg <- my.reg
s <- my.s.resid

the.replication <- function(reg,s,x_Np1=0){
  # Make bootstrap residuals
  ep.star <- sample(s,size=length(reg$residuals),replace=TRUE)

  # Make bootstrap Y
  y.star <- fitted(reg)+ep.star

  # Do bootstrap regression
  x <- model.frame(reg)[,2]
  bs.reg <- lm(y.star~x)

  # Create bootstrapped adjusted residuals
  bs.lev <- influence(bs.reg)$hat
  bs.s   <- residuals(bs.reg)/sqrt(1-bs.lev)
  bs.s   <- bs.s - mean(bs.s)

  # Calculate draw on prediction error
  xb.xb <- coef(my.reg)["(Intercept)"] - coef(bs.reg)["(Intercept)"] 
  xb.xb <- xb.xb + (coef(my.reg)["x"] - coef(bs.reg)["x"])*x_Np1
  return(unname(xb.xb + sample(bs.s,size=1)))
}

# Do bootstrap with 10,000 replications
ep.draws <- replicate(n=10000,the.replication(reg=my.reg,s=my.s.resid,x_Np1=78))

# Create prediction interval
y.p+quantile(ep.draws,probs=c(0.05,0.95))

# prediction interval using normal assumption
predict(my.reg,newdata=data.frame(x=78),interval="prediction",level=0.90)


# Quick and dirty Monte Carlo to see which prediction interval is better
# That is, what are the 5th and 95th percentiles of Y_{N+1}
# 
# To do it properly, I guess we would want to do the whole procedure above
# 10,000 times and then see what percentage of the time each prediction 
# interval covered Y_{N+1}

y.np1 <- 1 + 78 + (rexp(n=10000,rate=0.25)-4)
quantile(y.np1,probs=c(0.05,0.95))
Related Question