Cross-validation – Can K-fold Cross Validation Cause Overfitting?

cross-validationoverfitting

I am learning $k$-fold cross validation. Since each fold will be used to train the model (in $k$ iterations), won't that cause overfitting?

Best Answer

K-fold cross validation is a standard technique to detect overfitting. It cannot "cause" overfitting in the sense of causality.

However, there is no guarantee that k-fold cross-validation removes overfitting. People are using it as a magic cure for overfitting, but it isn't. It may not be enough.

The proper way to apply cross-validation is as a method to detect overfitting. If you do CV, and if there is a big difference between the test and the training error then you know you are overfitting and need to get more diverse data or choose simpler models and stronger regularization. The contrary does not hold: no big difference between test and train error does not mean you haven't been overfitting.

It's not a magic cure, but the best method to detect overfitting we have (when used right).

Some examples when cross-validation can fail:

  • data is ordered, and not shuffled prior to splitting
  • unbalanced data (try stratified cross-validation)
  • duplicates in different folds
  • natural groups (e.g., data from the same user) shuffled into multiple folds

There are other cases where it cannot detect information leakage and overtitting even when used perfectly right. For example when analyzing time series, people like to standardize the data, split it into past and future data, then train a model to predict the future development of these stocks. The subtle information leakage was in the preprocessing: standardization prior to temporal splitting leaks information about the average of the remainder. Similar leaks can occur in other preprocessing. In outlier detection, if you scale the data to 0:1, a model can learn that values close to 0 and 1 are the most extreme values you can observe etc.

Back to your question:

Since each fold will be used to train the model (in  iterations), won't that cause overfitting?

No. Each fold is used to train a new model from scratch, predict the accuracy, and then the model is discarded. You don't use any of the models trained during CV.

You use validation (such as CV) for two purposes:

  1. Estimate how good your model will (hopefully) work in practise when you deploy it, without risking a real A-B-test in production yet. You only want to go live with models that are expected to work better than you current approach, or this may cost your company millions.
  2. Find the "best" parameters for train your final model (which you want to train on the entire training data). Tuning hyperparameters is when you have a high risk of overfitting if you are not careful.

CV is not a way of "training" a model by feeding 10 batches of data.