Solved – k-fold cross validation overestimates test error

cross-validation

Say I have 1000 points in my data and perform 5-fold cross validation.
Each fold uses 800 instances to train the model, then calculates the error on the holdout fold of 200.

The 800 training instances will not be distributed exactly as per the original 1000. The average of each predictor variable amongst the 800 will be slightly higher or lower than the average of the 1000. So the 200 holdout fold are necessarily biased in the opposite direction of the 800.

Does it follow that the cross validation error is biased higher than the error on an unseen testing sample?
In my simplistic terms, each fold is tested on data that systematically differs from its training data.

I am aware that k-fold cross validation can overestimate test set error because each fold removes 1/k training instances, provided error increases as training set size decreases. This point is made much better in Hastie, Elements of Statistical Learning 7.10.
I think my question/point is different since it does not rely on error varying with training data size, but please correct or enlighten me.
Thanks.

Best Answer

You're describing a scenario in which you use k-fold cross validation on the training set, and the holdout would be considered the validation set, and is not the test set.

Use the validation set to find parameters which minimize the training error, then run on the test set.