Solved – Cross-validation misuse (reporting performance for the best hyperparameter value)

cross-validationmodel selectionmodel-evaluationreferences

Recently I have come across a paper that proposes using a k-NN classifier on an specific dataset. The authors used all the data samples available to perform k-fold cross validation for different k values and report cross validation results of the best hyperparameter configuration.

To my knowledge, this result is biased, and they should have retained a separate test set to obtain an accuracy estimate on samples not used to perform hyperparameter optimization.

Am I right? Can you provide some references (preferably research papers) that describe this misuse of cross validation?

Best Answer

Yes, there are issues with reporting only k-fold CV results. You could use e.g. the following three publications for your purpose (though there are more out there, of course) to point people towards the right direction:

Varma & Simon (2006). "Bias in error estimation when using cross-validation for model selection." BMC Bioinformatics, 7: 91
Cawley & Talbot (2010). "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation." Journal of Machine Learning Research, 11: 2079-2107
Bengio & Grandvalet (2004). "No Unbiased Estimator of the Variance of $K$-Fold Cross-Validation." Journal of Machine Learning Research, 5: 1089–1105

I personally like those because they try to state the issues more in plain English than in Math.

Related Solutions

Solved – Is cross validation a proper substitute for validation set

You have indeed correctly described the way to work with crossvalidation. In fact, you are 'lucky' to have a reasonable validation set at the end, because often, crossvalidation is used to optimize a model, but no "real" validation is done.

As @Simon Stelling said in his comment, crossvalidation will lead to lower estimated errors (which makes sense because you are constantly reusing the data), but fortunately this is the case for all models, so, barring catastrophy (i.e.: errors are only reduced slightly for a "bad" model, and more for "the good" model), selecting the model that performs best on a crossvalidated criterion, will typically also be the best "for real".

A method that is sometimes used to correct somewhat for the lower errors, especially if you are looking for parsimoneous models, is to select the smallest model/simplest method for which the crossvalidated error is within one SD from the (crossvalidated) optimum. As crossvalidation itself, this is a heuristic, so it should be used with some care (if this is an option: make a plot of your errors against your tuning parameters: this will give you some idea whether you have acceptable results)

Given the downward bias of the errors, it is important to not publish the errors or other performance measure from the crossvalidation without mentioning that these come from crossvalidation (although, truth be told: I have seen too many publications that don't mention that the performance measure was obtained from checking the performance on the original dataset either --- so mentioning crossvalidation actually makes your results worth more). For you, this will not be an issue, since you have a validation set.

A final warning: if your model fitting results in some close competitors, it is a good idea to look at their performances on your validation set afterwards, but do not base your final model selection on that: you can at best use this to soothe your conscience, but your "final" model must have been picked before you ever look at the validation set.

Wrt your second question: I believe Simon has given your all the answers you need in his comment, but to complete the picture: as often, it is the bias-variance trade-off that comes into play. If you know that, on average, you will reach the correct result (unbiasedness), the price is typically that each of your individual calculations may lie pretty far from it (high variance). In the old days, unbiasedness was the nec plus ultra, in current days, one has accepted at times a (small) bias (so you don't even know that the average of your calculations will result in the correct result), if it results in lower variance. Experience has shown that the balance is acceptable with 10-fold crossvalidation. For you, the bias would only be an issue for your model optimization, since you can estimate the criterion afterwards (unbiasedly) on the validation set. As such, there is little reason not to use crossvalidation.

Solved – Cross-validation and building a final model when using hyperparameter optimization

No, you are not doing it right.

As a general principle, cross-validation does NOT estimate the performance of one particular model. Instead, it estimates the performance of a given model-building procedure. This is very important to understand when using cross-validation.

Your model-building procedure involves hyperparameter optimization. So when you do your cross-validation, you should use this exact procedure on each training set, i.e. you should perform hyperparameter optimization on each training set separately. The cross-validation will then give you a performance estimate. Afterwards, if you need a "final" model, you apply your model-building procedure to the full dataset, i.e. you perform hyperparameter optimization on the full data.

This should answer your questions ##1-3.

Note that your hyperparameter optimization is an empirical Bayes procedure that is conceptually similar to cross-validation: see Cross-validation vs empirical Bayes for estimating hyperparameters. If your model-building procedure used cross-validation instead of hyper-parameter optimization, then the logic described above would lead you to the so called nested cross-validation.

Best Answer

Related Solutions

Solved – Is cross validation a proper substitute for validation set

Solved – Cross-validation and building a final model when using hyperparameter optimization

Related Question