Solved – Should I use the mean-squared-prediction-error from LOOCV for prediction intervals

cross-validationprediction intervalr

I have a question about which prediction variance to use to calculate prediction intervals from a fitted lm object in R.

For a certain multiple linear regression model I have obtained an error variance with leave-one-out-cross-validation (LOOCV) by taking the mean of the squared difference between observed and predicted values (i.e., mean squared prediction error). I am aware of some of the drawbacks of LOOCV (e.g., When are Shao's results on leave-one-out cross-validation applicable?), but for my specific application this was the easiest (and probably the only realistically) implementable CV method. The final fitted linear model (fitted_lm) is fitted with all observations and with this model I would like to make predictions for new observations (new_observations). For this I am using the predict.lm function in R.

predict(fitted_lm, new_observations, interval = "prediction", pred.var = ???)

My questions are:

  • What value do I use for pred.var (i.e., “the variance(s) for future observations to be assumed for prediction intervals”) in order to obtain realistic prediction intervals for my new_observations?
  • Do I use the error variance obtained from the LOOCV, or do I use the function’s default (i.e., “the default is to assume that future observations have the same error variance as those used for fitting”)?
  • Is the mean squared prediction error not appropriate in this case?

Following up on Michael Chernick's answer hereunder, I had a look in the Draper & Smith (1998) book (“Applied regression analysis. 3rd Edition”). In this book s2 is defined as “variance about the regression” (p 32). This is, I presume, what we describe below as the model estimate of residual variance. Furthermore, this book mentions:

“Since the actual observed value of Y varies about the true mean value σ2 [independent of the V(Ŷ)], a predicted value of an individual observation will still be given with Ŷ but will have variance

formula

With corresponding estimated value obtained by inserting s2 for σ2” (pp 82-81).

Thus, as far as I understand, in the D & S book they only use the model estimate of residual variance to calculate confidence intervals. This would be the default setting in the predict function (function help: “the default is to assume that future observations have the same error variance as those used for fitting”). However, as fosgen states below, “although LOOCV mean squared prediction error is not equal to the real mean squared prediction error, it is much more close to real than error variance of fitted model”.

To make this more concrete; in my dataset I get a model estimate of residual variance of 0.005998 and a LOOCV mean squared prediction error of 0.007293. What should I then fill in as pred.var in the predict.lm function:

  • Nothing (i.e. use the default, which would equal to the model estimate of residual variance)
  • 0.007293 (i.e. the LOOCV mean squared prediction error)
  • 0.005998 + 0.007293 (Michael Chernick: “The model estimate of residual variance gets added to the error variance due to estimating the parameters to get the prediction error variance for a new observation”).

Best Answer

I would recommend using the LOOCV estimate as the usual estimate of variance will be biased (because it has been directly minimised in fitting the model). The LOOCV is a reasonable step in eliminating this bias, and I have found it to be fairly useful for estimating the conditional variance in heteroscedastic regression problems, where the width of the prediction interval varies. In that problem, the model is non-linear, so this bias can be substantial, and the variance is modelled, rather than merely estimated, so the bias is quite important. For details, see

G. C. Cawley, N. L. C. Talbot, R. J. Foxall, S. R. Dorling and D. P. Mandic, Heteroscedastic kernel ridge regression, Neurocomputing, vol. 57, pp 105-124, March 2004. (pdf, doi, MATLAB demo)

Related Question