I do not think this is conflicting advice. What we are really interested in is good out-of-sample performance, not in reducing the gap between training and test set performance. If the test set performance is representative of out-of-sample performance (i.e. the test set is large enough, uncontaminated and is a representative sample of the data our model will be applied to), then as long as we get good performance on the test set we are not overfitting, regardless of the gap.
Often, however, if there is a large gap, it may indicate that we could get better test set performance with more regularization/introducing more bias to the model. But that does not mean that a smaller gap means a better model; it's just that if we have a small or no gap between training and test set performance, we know we are definitely not overfitting so adding regularization/introducing more bias to the model will not help.
Yes, your understanding is correct.
Cross validation tests the predictive ability of different models by splitting the data into training and testing sets,
Yes.
and this helps check for overfitting.
Model selection or hyperparameter tuning is one purpose to which the CV estimate of predictive performance can be used. It is IMHO important not to confuse CV with to what purpose its results are employed.
In the first place cross validation yields an approximation to generalization error (the expected predictive performance on unseen data of a model).
This estimate can either be used as
- an approximation of generalization error of the model fitted on the whole data set with the same (hyper)paramter deterimination as was used for the CV surrogate models.
- or to select hyperparameters. If you do this, this CV estimate becomes part of model training, and you need an independent estimate for generalization error, see e.g. nested aka double cross validation.
As for overfitting within the model training procedure, CV helps but cannot work miracles. Keep in mind that cross validation results are also subject to variance (of various sources). Thus, with increasing number of models/hyperparameter settings in the comparison there is also an increased risk of accidentally (due to variance of the CV estimates) observing very good prediction and being mislead by this (see the one-standard-deviation rule for a heuristic against this).
For this value of k, fit the model on the entire dataset.
The many so-called surrogate models built and tested during cross validation are usually treated as good approximation to applying the same training function to the entire data set which allows to use the generalization error results obtained for the surrogate models as approximation for generalization error of the "final" model.
This applies regardless of the use you put this generalization error estimate to later on.
Best Answer
Sorry, my rep is too low to comment so will be posting as an answer.
The benefit of conducting CV is that you can train your model over the entire data that you have and yet still be able to get a good estimate of the true error of your model.
The more variables you include in your model, the lower the training error will get. However, doing so results in overfitting because your model becomes too specialized to its training data that when unseen data comes along it will instead perform worse. As Michael said, this is due to the model, in order to minimize training error, ends up fitting to the noise present in the data. When you then try to predict unseen data which will have a different noise signature using the model, you will end up getting a greater prediction error.
CV simulates this environment by holding out data for test purposes. This plays the role of the unseen data. CV then does this K times and averages the error as the validation error. Hence the validation error increases if the model is overfitted.