Solved – How to use press statistic for model selection

cross-validationmodel-evaluationregression

I am confused about how to use the PRESS statistic to compare models.I understand that the PRESS statistic is calculated by summing the square of the residuals as :

$$\text{PRESS} = \sum_{i=1}^n (y_i – \hat{y}_{i, -i})^2$$

where the residual is the difference between the observed and predicted value for the $i$-th data point, with the prediction coming from a model trained on data with the $i$-th data point removed. My confusion lies in the fact that a new regression equation (hence a new model) is estimated each time a data point is dropped (so $n$ different models are trained in the process of calculating PRESS) – so the final PRESS statistic is not tied to a single model. In that case, how can you use the PRESS statistics to compare two different models? How do you calculate a PRESS statistic for a given regression model? I think I am making a basic mistake somewhere here but not sure where my reasoning is off. Thanks for any help.

Best Answer

You calculate PRESS on a model trained on $n$ values to get an idea of its out-of-sample performance, by leaving out one sample at a time. So while you indeed end up with $n$ models to determine the statistic, you eventually use the original model trained on all $n$ values.

Since you are only leaving out a single observation at a time (like in LOOCV), the addition of the 'last' sample has minimal influence on the final model. Because of this you can safely use PRESS to compare models, even though the actual models you are comparing were not used to calculate it.

If you have a larger sample size, you could consider a form of nested cross-validation, comparing models with the inner cross-validation and evaluating the performance of the 'winning' model on the outer cross-validation loop.