Solved – Should I use the mean-squared-prediction-error from LOOCV for prediction intervals

cross-validationprediction intervalr

I have a question about which prediction variance to use to calculate prediction intervals from a fitted lm object in R.

For a certain multiple linear regression model I have obtained an error variance with leave-one-out-cross-validation (LOOCV) by taking the mean of the squared difference between observed and predicted values (i.e., mean squared prediction error). I am aware of some of the drawbacks of LOOCV (e.g., When are Shao's results on leave-one-out cross-validation applicable?), but for my specific application this was the easiest (and probably the only realistically) implementable CV method. The final fitted linear model (fitted_lm) is fitted with all observations and with this model I would like to make predictions for new observations (new_observations). For this I am using the predict.lm function in R.

predict(fitted_lm, new_observations, interval = "prediction", pred.var = ???)

My questions are:

What value do I use for pred.var (i.e., “the variance(s) for future observations to be assumed for prediction intervals”) in order to obtain realistic prediction intervals for my new_observations?
Do I use the error variance obtained from the LOOCV, or do I use the function’s default (i.e., “the default is to assume that future observations have the same error variance as those used for fitting”)?
Is the mean squared prediction error not appropriate in this case?

Following up on Michael Chernick's answer hereunder, I had a look in the Draper & Smith (1998) book (“Applied regression analysis. 3rd Edition”). In this book s² is defined as “variance about the regression” (p 32). This is, I presume, what we describe below as the model estimate of residual variance. Furthermore, this book mentions:

“Since the actual observed value of Y varies about the true mean value σ² [independent of the V(Ŷ)], a predicted value of an individual observation will still be given with Ŷ but will have variance

With corresponding estimated value obtained by inserting s² for σ²” (pp 82-81).

Thus, as far as I understand, in the D & S book they only use the model estimate of residual variance to calculate confidence intervals. This would be the default setting in the predict function (function help: “the default is to assume that future observations have the same error variance as those used for fitting”). However, as fosgen states below, “although LOOCV mean squared prediction error is not equal to the real mean squared prediction error, it is much more close to real than error variance of fitted model”.

To make this more concrete; in my dataset I get a model estimate of residual variance of 0.005998 and a LOOCV mean squared prediction error of 0.007293. What should I then fill in as pred.var in the predict.lm function:

Nothing (i.e. use the default, which would equal to the model estimate of residual variance)
0.007293 (i.e. the LOOCV mean squared prediction error)
0.005998 + 0.007293 (Michael Chernick: “The model estimate of residual variance gets added to the error variance due to estimating the parameters to get the prediction error variance for a new observation”).

Best Answer

I would recommend using the LOOCV estimate as the usual estimate of variance will be biased (because it has been directly minimised in fitting the model). The LOOCV is a reasonable step in eliminating this bias, and I have found it to be fairly useful for estimating the conditional variance in heteroscedastic regression problems, where the width of the prediction interval varies. In that problem, the model is non-linear, so this bias can be substantial, and the variance is modelled, rather than merely estimated, so the bias is quite important. For details, see

G. C. Cawley, N. L. C. Talbot, R. J. Foxall, S. R. Dorling and D. P. Mandic, Heteroscedastic kernel ridge regression, Neurocomputing, vol. 57, pp 105-124, March 2004. (pdf, doi, MATLAB demo)

Related Solutions

How to Calculate Prediction Intervals for LOESS

I don't know how to do prediction bands with the original loess function but there is a function loess.sd in the msir package that does just that! Almost verbatim from the msir documentation:

library(msir)
data(cars)
# Calculates and plots a 1.96 * SD prediction band, that is,
# a 95% prediction band
l <- loess.sd(cars, nsigma = 1.96)
plot(cars, main = "loess.sd(cars)", col="red", pch=19)
lines(l$x, l$y)
lines(l$x, l$upper, lty=2)
lines(l$x, l$lower, lty=2)

enter image description here

Your second question is a bit trickier since loess.sd doesn't come with a prediction function, but you can hack it together by linearly interpolating the predicted means and SDs you get out of loess.sd (using approx). These can, in turn, be used to simulate data using a normal distribution with the predicted means and SDs:

# Simulate x data uniformly and y data acording to the loess fit
sim_x <- runif(100, min(cars[,1]), max(cars[,1]))
pred_mean <- approx(l$x, l$y, xout = sim_x)$y
pred_sd <- approx(l$x, l$sd, xout = sim_x)$y
sim_y <- rnorm(100, pred_mean, pred_sd) 

# Plots 95% prediction bands with simulated data 
plot(cars, main = "loess.sd(cars)", col="red", pch=19)
points(sim_x, sim_y, col="blue")
lines(l$x, l$y)
lines(l$x, l$upper, lty=2)
lines(l$x, l$lower, lty=2)

enter image description here

Solved – Which formula does the forecast package in R use to calculate variance/ standard error for prediction intervals

You must use the standard deviation of the prediction errors, rather than the standard deviation of the residuals. The former varies for each forecast period.

For example, in an AR(1) model

$$y_t=\phi y_{t-1} + \varepsilon_t \,, \quad \varepsilon_t\sim NID(0, \sigma^2) \,, \quad t=1,\dots,T\,,$$

the variance of the prediction errors can be obtained as follows.

The h-steps ahead prediction errors are defined as $e(h) = y_{T+h} - E(y_{T+h})$ (the value determined by the model minus its expectation):

$$ \begin{align} e(1) &= y_{T+1} - E(y_{T+1}) = (\phi y_T+\varepsilon_{T+1}) - \phi y_T = \varepsilon_{T+1} \\ e(2) &= y_{T+2} - E(y_{T+2}) = (\phi \underbrace{y_{T+1}}_{\phi y_T+\varepsilon_{T+1}}+\varepsilon_{T+2}) - E(y_{T+2}) \\ &= \underbrace{(\phi^2y_T+\phi \varepsilon_{T+1}+\varepsilon_{T+2})}_{y_{T+2}} - \phi^2y_T) \\ &= \phi \varepsilon_{T+1} + \varepsilon_{T+2} \\ e(3) &= \underbrace{\phi^3y_T+\phi^2\varepsilon_{T+1}+\phi \varepsilon_{T+1} + \varepsilon_{T+3}}_{y_{T+3}} - E(y_{T+3}) \\ &= \phi^3y_T+\phi^2\varepsilon_{T+1}+\phi \varepsilon_{T+1} + \varepsilon_{T+3} - \phi^3y_T = \phi^2\varepsilon_{T+1}+\phi \varepsilon_{T+2} + \varepsilon_{T+3} \\ \cdots & = \cdots \\ e(h) &= \phi^{h-1}\varepsilon_{T+1}+\dots+\phi \varepsilon_{T+h-1} + \varepsilon_{T+h} \end{align} $$

The variances are straightforward to obtain (just note that $Var(\phi^k \varepsilon_t)=\phi^{2k}\sigma^2$):

$$ \begin{align} Var(e(1)) &= \sigma^2 \\ Var(e(2)) &= \sigma^2(1+\phi^2) \\ Var(e(3)) &= \sigma^2(1+\phi^2+\phi^4) \\ \dots &= \dots \end{align} $$

The expressions of the variances vary with the order of the model. A common way to compute the variances of the prediction errors in a general ARMA(p,q) is by means of the Kalman filter. Here is some code for illustration.

These are your data:

x <- ts(c(319.32, 320.36, 320.82, 322.06, 322.17,321.95,321.20,318.81,317.82, 317.37, 318.93, 319.09, 319.94, 320.98, 321.81, 323.03,323.36, 323.11, 321.65, 319.64, 317.86, 317.25, 319.06, 320.26, 321.65, 321.81,322.36, 323.67, 324.17, 323.39, 321.93, 320.29, 318.58, 318.60,319.98, 321.25, 321.88, 322.47, 323.17, 324.23, 324.88, 324.75, 323.47, 321.34,319.56, 319.45, 320.45, 321.92, 323.40, 324.21, 325.33, 326.31,327.01, 326.24, 325.37, 323.12, 321.85, 321.31, 322.31, 323.72, 324.60, 325.57,326.55, 327.80, 327.80, 327.54, 326.28, 324.63, 323.12, 323.11, 323.99, 325.09, 326.12, 326.61, 327.16, 327.92, 329.14, 328.80, 327.52,325.62, 323.61, 323.80, 325.10, 326.25, 326.93, 327.83, 327.95,329.91, 330.22, 329.25, 328.11, 326.39, 324.97, 325.32, 326.54, 327.71, 328.73, 329.69, 330.47, 331.69, 332.65, 332.24, 331.03, 329.36, 327.60, 327.29, 328.28, 328.79, 329.45, 330.89, 331.63, 332.85, 333.28, 332.47, 331.34, 329.53, 327.57, 327.57, 328.53, 329.69, 330.45, 330.97, 331.64, 332.87, 333.61, 333.55, 331.90, 330.05, 328.58, 328.31, 329.41, 330.63, 331.63, 332.46, 333.36, 334.45, 334.82, 334.32, 333.05, 330.87, 329.24, 328.87, 330.18, 331.50, 332.81, 333.23, 334.55, 335.82, 336.44, 335.99, 334.65, 332.41, 331.32, 330.73, 332.05, 333.53, 334.66, 335.07, 336.33, 337.39, 337.65, 337.57, 336.25, 334.39, 332.44, 332.25, 333.59, 334.76, 335.89, 336.44, 337.63, 338.54, 339.06, 338.95, 337.41, 335.71, 333.68, 333.69, 335.05, 336.53, 337.81, 338.16, 339.88, 340.57, 341.19, 340.87, 339.25, 337.19, 335.49, 336.63, 337.74, 338.36), frequency=12, start=c(1965,1))

These are the results from forecast that you want to reproduce:

require("forecast")
fit <- Arima(x, order = c(1,1,0), seasonal = list(order =c(0,1,1), period =12))
forecast(fit, h=4)
     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
# Jan 1981       339.5735 339.1483 339.9987 338.9232 340.2238
# Feb 1981       340.1759 339.6493 340.7025 339.3706 340.9813
# Mar 1981       341.2141 340.5865 341.8417 340.2543 342.1739
# Apr 1981       342.3020 341.5914 343.0125 341.2153 343.3886

The state space representation of the fitted model is available in fit$model, which is the input required to run the Kalman filter.

kf <- KalmanForecast(n.ahead = 4, fit$model)
pred.vars <- kf$var * fit$sigma2
kf$pred - 1.96 * sqrt(pred.vars)
# [1] 338.9232 339.3706 340.2543 341.2153
kf$pred + 1.96 * sqrt(pred.vars)
# [1] 340.2238 340.9813 342.1740 343.3886

The variances of prediction errors returned by the Kalman filter are scaled by the estimated variance of the residuals, $\sigma^2$. Then, the confidence intervals are obtained as you did (you can see that these values match the columns Lo 95 and Hi 95 obtained above).

Edit This may depart from your question, but as mentioned by @IrishStat, if the assumption of Gaussian innovations is not plausible, it is worth considering some alternative methods. For example, the option bootstrap=TRUE computes bootstrapped confidence intervals. After a quick view to the documentation, I think this option resamples the residuals and generates an ensemble of replicates based on the parameters of the fitted model. Although the distribution of the residuals of your fitted model depart from the Gaussian distribution (due to excess of kurtosis as suggested for example by the histogram), graphically there are no major differences in the confidence bands. Here are the numbers:

set.seed(1)
forecast(fit, h=4, bootstrap=TRUE)
#          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
# Jan 1981       339.5735 339.2259 339.9744 338.9472 340.2538
# Feb 1981       340.1759 339.7288 340.6966 339.4446 341.0239
# Mar 1981       341.2141 340.6837 341.8561 340.3736 342.2230
# Apr 1981       342.3020 341.6879 343.0394 341.3340 343.4285

Aside note: I think your question is more related to theoretical issues rather than on practical issues and your are using this model and data mainly for illustration. However, as preparing this answer I had a closer look to your data, I would suggest an ARIMA(0,1,1)(0,1,1) model with a temporary change at observation 190 (october 1980). I don't claim it is the best model or that it will perform better for forecasting, but according to the diagnostic of the residuals, it is better than the ARIMA(1,1,0)(0,1,1).

Best Answer

Related Solutions

How to Calculate Prediction Intervals for LOESS

Solved – Which formula does the forecast package in R use to calculate variance/ standard error for prediction intervals

Related Question