Solved – Difference between model fitting and cross validation

cross-validation

I see these concepts quite often and want to see if I have the right intuitive understanding.

Model fitting is when I have a set of data and fit a model (e.g. linear regression) as 'close' to the data as possible based on some loss function (e.g. square loss). This could result in overfitting, since a higher-order polynomial model will always have less SSE than a lower-order model.

Cross validation tests the predictive ability of different models by splitting the data into training and testing sets, and this helps check for overfitting.

For instance, if I fit a second-order polynomial to linear data, I will get a lower SSE but probably not a lower prediction error. Therefore, between the two, I should choose the linear model. A different example: if I am fitting a k nearest neighbors model, then for each value of k (up to a reasonable number), fit the model as close to the training data as possible. Then, compare the prediction error on the testing data between the different values of k, and pick the one that has the lowest prediction error. For this value of k, fit the model on the entire dataset.

Do I have the right idea?

Best Answer

Yes, your understanding is correct.

Cross validation tests the predictive ability of different models by splitting the data into training and testing sets,

Yes.

and this helps check for overfitting.

Model selection or hyperparameter tuning is one purpose to which the CV estimate of predictive performance can be used. It is IMHO important not to confuse CV with to what purpose its results are employed.

In the first place cross validation yields an approximation to generalization error (the expected predictive performance on unseen data of a model).

This estimate can either be used as

  • an approximation of generalization error of the model fitted on the whole data set with the same (hyper)paramter deterimination as was used for the CV surrogate models.
  • or to select hyperparameters. If you do this, this CV estimate becomes part of model training, and you need an independent estimate for generalization error, see e.g. nested aka double cross validation.

As for overfitting within the model training procedure, CV helps but cannot work miracles. Keep in mind that cross validation results are also subject to variance (of various sources). Thus, with increasing number of models/hyperparameter settings in the comparison there is also an increased risk of accidentally (due to variance of the CV estimates) observing very good prediction and being mislead by this (see the one-standard-deviation rule for a heuristic against this).

For this value of k, fit the model on the entire dataset.

The many so-called surrogate models built and tested during cross validation are usually treated as good approximation to applying the same training function to the entire data set which allows to use the generalization error results obtained for the surrogate models as approximation for generalization error of the "final" model.
This applies regardless of the use you put this generalization error estimate to later on.