Solved – Optimal number of folds in $K$-fold cross-validation: is leave-one-out CV always the best choice

bias-variance tradeoffcross-validation

Computing power considerations aside, are there any reasons to believe that increasing the number of folds in cross-validation leads to better model selection/validation (i.e. that the higher the number of folds the better)?

Taking the argument to the extreme, does leave-one-out cross-validation necessarily lead to better models than $K$-fold cross-validation?

Some background on this question: I am working on a problem with very few instances (e.g. 10 positives and 10 negatives), and am afraid that my models may not generalize well/would overfit with so little data.

Best Answer

Leave-one-out cross-validation does not generally lead to better performance than K-fold, and is more likely to be worse, as it has a relatively high variance (i.e. its value changes more for different samples of data than the value for k-fold cross-validation). This is bad in a model selection criterion as it means the model selection criterion can be optimised in ways that merely exploit the random variation in the particular sample of data, rather than making genuine improvements in performance, i.e. you are more likely to over-fit the model selection criterion. The reason leave-one-out cross-validation is used in practice is that for many models it can be evaluated very cheaply as a by-product of fitting the model.

If computational expense is not primarily an issue, a better approach is to perform repeated k-fold cross-validation, where the k-fold cross-validation procedure is repeated with different random partitions into k disjoint subsets each time. This reduces the variance.

If you have only 20 patterns, it is very likely that you will experience over-fitting the model selection criterion, which is a much neglected pitfall in statistics and machine learning (shameless plug: see my paper on the topic). You may be better off choosing a relatively simple model and try not to optimise it very aggressively, or adopt a Bayesian approach and average over all model choices, weighted by their plausibility. IMHO optimisation is the root of all evil in statistics, so it is better not to optimise if you don't have to, and to optimise with caution whenever you do.

Note also if you are going to perform model selection, you need to use something like nested cross-validation if you also need a performance estimate (i.e. you need to consider model selection as an integral part of the model fitting procedure and cross-validate that as well).