Test Error – Why CV Estimate Underestimates Actual Test Error

biascross-validation

It is my understanding that the k-fold cross-validation estimate of test error usually underestimates actual test error. I’m confused why this is the case. I see why the training error is usually lower than the test error – because you are training the model on the very same data that you are estimating the error on! But that isn’t the case for cross-validation – the fold that you measure error on is specifically left out during the training process.

Also, Is it correct to say that cross-validation estimate of test error is biased downward?

Best Answer

To give an example: reporting only the CV error of a model is problematic in case you originally have multiple models (each having a certain CV error and error variance), then use this error to chose the best suited model for your application. This is problematic because with each model you still have a certain chance that you are lucky/unlucky (and obtain better/worse results) - and by choosing a model, you likely also chose the one where you were more lucky. Therefore, reporting this error as final error estimate tends to be overly optimistic.

If you want to dig deeper into the details: this answer links to some easy-to-read papers on this problem: Cross-validation misuse (reporting performance for the best hyperparameter value)

As @cbeleites points out: this is problematic in case one uses the obtained k-fold CV error to e.g. a) chose a best model out of multiple models from using e.g. different hyperparameters, which is part of the training process, and b) then reports the same error as test error instead of using a separate, held-back test set. If you instead intended to ask for the pure CV error itself - without using it to chose any model - the answer by @cbeleites is more likely what you are searching for.

Related Question