I see these concepts quite often and want to see if I have the right intuitive understanding.
Model fitting is when I have a set of data and fit a model (e.g. linear regression) as 'close' to the data as possible based on some loss function (e.g. square loss). This could result in overfitting, since a higher-order polynomial model will always have less SSE than a lower-order model.
Cross validation tests the predictive ability of different models by splitting the data into training and testing sets, and this helps check for overfitting.
For instance, if I fit a second-order polynomial to linear data, I will get a lower SSE but probably not a lower prediction error. Therefore, between the two, I should choose the linear model. A different example: if I am fitting a k nearest neighbors model, then for each value of k (up to a reasonable number), fit the model as close to the training data as possible. Then, compare the prediction error on the testing data between the different values of k, and pick the one that has the lowest prediction error. For this value of k, fit the model on the entire dataset.
Do I have the right idea?
Best Answer
Yes, your understanding is correct.
Yes.
Model selection or hyperparameter tuning is one purpose to which the CV estimate of predictive performance can be used. It is IMHO important not to confuse CV with to what purpose its results are employed.
In the first place cross validation yields an approximation to generalization error (the expected predictive performance on unseen data of a model).
This estimate can either be used as
As for overfitting within the model training procedure, CV helps but cannot work miracles. Keep in mind that cross validation results are also subject to variance (of various sources). Thus, with increasing number of models/hyperparameter settings in the comparison there is also an increased risk of accidentally (due to variance of the CV estimates) observing very good prediction and being mislead by this (see the one-standard-deviation rule for a heuristic against this).
The many so-called surrogate models built and tested during cross validation are usually treated as good approximation to applying the same training function to the entire data set which allows to use the generalization error results obtained for the surrogate models as approximation for generalization error of the "final" model.
This applies regardless of the use you put this generalization error estimate to later on.