Solved – How to use in sample error estimator in regression

errorregression

In their book The Elements of Statistical Learning, Friedman and coll. talk about the in sample error $Err_{in}$ (p. 229):
$$Err_{in} = \frac{1}{n}\sum_{i=1}^n\mathbb{E}\left[\left(Y_i-\hat{f}\left( x_i\right)\right)^2\right]
$$
there exists an unbiased estimator of this error in case of linear regression which is
$$Err_{in}=\text{training_error} + 2\frac{d}{n}\hat{\sigma_{\epsilon}}^2$$
with $d$ the dimension, $n$ the number of observations and $\hat{\sigma_{\epsilon}}^2$ the unbiased estimator of the residuals.

My question is how do you use this criterion ?

Best Answer

The idea is that you can compare this "in sample error" for different models.

So you fit a number of models (with more or less main effects and/or interactions) that you think might be of interest. Next you calculate the in sample error for each of these (fitted) models, and you find the model with the smallest in sample error: this is supposedly the 'best' model (wrt this criterion).

In practice, many people use crossvalidation to achieve model comparison instead of this in sample error, but different criteria may give different results (which may or may not be wanted).