Leave-one-out cross-validation does not generally lead to better performance than K-fold, and is more likely to be worse, as it has a relatively high variance (i.e. its value changes more for different samples of data than the value for k-fold cross-validation). This is bad in a model selection criterion as it means the model selection criterion can be optimised in ways that merely exploit the random variation in the particular sample of data, rather than making genuine improvements in performance, i.e. you are more likely to over-fit the model selection criterion. The reason leave-one-out cross-validation is used in practice is that for many models it can be evaluated very cheaply as a by-product of fitting the model.
If computational expense is not primarily an issue, a better approach is to perform repeated k-fold cross-validation, where the k-fold cross-validation procedure is repeated with different random partitions into k disjoint subsets each time. This reduces the variance.
If you have only 20 patterns, it is very likely that you will experience over-fitting the model selection criterion, which is a much neglected pitfall in statistics and machine learning (shameless plug: see my paper on the topic). You may be better off choosing a relatively simple model and try not to optimise it very aggressively, or adopt a Bayesian approach and average over all model choices, weighted by their plausibility. IMHO optimisation is the root of all evil in statistics, so it is better not to optimise if you don't have to, and to optimise with caution whenever you do.
Note also if you are going to perform model selection, you need to use something like nested cross-validation if you also need a performance estimate (i.e. you need to consider model selection as an integral part of the model fitting procedure and cross-validate that as well).
why would models learned with leave-one-out CV have higher variance?
[TL:DR] A summary of recent posts and debates (July 2018)
This topic has been widely discussed both on this site, and in the scientific literature, with conflicting views, intuitions and conclusions. Back in 2013 when this question was first asked, the dominant view was that LOOCV leads to larger variance of the expected generalization error of a training algorithm producing models out of samples of size $n(K−1)/K$.
This view, however, appears to be an incorrect generalization of a special case and I would argue that the correct answer is: "it depends..."
Paraphrasing Yves Grandvalet the author of a 2004 paper on the topic I would summarize the intuitive argument as follows:
- If cross-validation were averaging independent estimates: then leave-one-out CV one should see relatively lower variance between models since we are only shifting one data point across folds and therefore the training sets between folds overlap substantially.
- This is not true when training sets are highly correlated: Correlation may increase with K and this increase is responsible for the overall increase of variance in the second scenario. Intuitively, in that situation, leave-one-out CV may be blind to instabilities that exist, but may not be triggered by changing a single point in the training data, which makes it highly variable to the realization of the training set.
Experimental simulations from myself and others on this site, as well as those of researchers in the papers linked below will show you that there is no universal truth on the topic. Most experiments have monotonically decreasing or constant variance with $K$, but some special cases show increasing variance with $K$.
The rest of this answer proposes a simulation on a toy example and an informal literature review.
[Update] You can find here an alternative simulation for an unstable model in the presence of outliers.
Simulations from a toy example showing decreasing / constant variance
Consider the following toy example where we are fitting a degree 4 polynomial to a noisy sine curve. We expect this model to fare poorly for small datasets due to overfitting, as shown by the learning curve.
Note that we plot 1 - MSE here to reproduce the illustration from ESLII page 243
Methodology
You can find the code for this simulation here. The approach was the following:
- Generate 10,000 points from the distribution $sin(x) + \epsilon$ where the true variance of $\epsilon$ is known
- Iterate $i$ times (e.g. 100 or 200 times). At each iteration, change the dataset by resampling $N$ points from the original distribution
- For each data set $i$:
- Perform K-fold cross validation for one value of $K$
- Store the average Mean Square Error (MSE) across the K-folds
- Once the loop over $i$ is complete, calculate the mean and standard deviation of the MSE across the $i$ datasets for the same value of $K$
- Repeat the above steps for all $K$ in range $\{ 5,...,N\}$ all the way to Leave One Out CV (LOOCV)
Impact of $K$ on the Bias and Variance of the MSE across $i$ datasets.
Left Hand Side: Kfolds for 200 data points, Right Hand Side: Kfolds for 40 data points
Standard Deviation of MSE (across data sets i) vs Kfolds
From this simulation, it seems that:
- For small number $N = 40$ of datapoints, increasing $K$ until $K=10$ or so significantly improves both the bias and the variance. For larger $K$ there is no effect on either bias or variance.
- The intuition is that for too small effective training size, the polynomial model is very unstable, especially for $K \leq 5$
- For larger $N = 200$ - increasing $K$ has no particular impact on both the bias and variance.
An informal literature review
The following three papers investigate the bias and variance of cross validation
Kohavi 1995
This paper is often refered to as the source for the argument that LOOC has higher variance. In section 1:
“For example, leave-oneout is almost unbiased, but it has high variance, leading to unreliable estimates (Efron 1983)"
This statement is source of much confusion, because it seems to be from Efron in 1983, not Kohavi. Both Kohavi's theoretical argumentations and experimental results go against this statement:
Corollary 2 ( Variance in CV)
Given a dataset and an inducer. If the inducer is stable under the perturbations caused by deleting the test instances for the folds in k-fold CV for various values of $k$, then the variance of the estimate will be the same
Experiment
In his experiment, Kohavi compares two algorithms: a C4.5 decision tree and a Naive Bayes classifier across multiple datasets from the UC Irvine repository. His results are below: LHS is accuracy vs folds (i.e. bias) and RHS is standard deviation vs folds
In fact, only the decision tree on three data sets clearly has higher variance for increasing K. Other results show decreasing or constant variance.
Finally, although the conclusion could be worded more strongly, there is no argument for LOO having higher variance, quite the opposite. From section 6. Summary
"k-fold cross validation with moderate k values (10-20) reduces the variance... As k-decreases (2-5) and the samples get smaller, there is variance due to instability of the training sets themselves.
Zhang and Yang
The authors take a strong view on this topic and clearly state in Section 7.1
In fact, in least squares linear regression, Burman (1989) shows that among the k-fold CVs, in estimating the prediction error, LOO (i.e., n-fold CV) has the smallest asymptotic bias and variance. ...
... Then a theoretical calculation (Lu, 2007) shows that LOO has the smallest bias and variance at the same time among all delete-n CVs with all possible n_v deletions considered
Experimental results
Similarly, Zhang's experiments point in the direction of decreasing variance with K, as shown below for the True model and the wrong model for Figure 3 and Figure 5.
The only experiment for which variance increases with $K$ is for the Lasso and SCAD models. This is explained as follows on page 31:
However, if model selection is involved, the performance of LOO worsens in variability as the model selection uncertainty gets higher due to large model space, small penalty coefficients and/or the use of data-driven penalty coefficients
Best Answer
I dont know answer and cant comment, but if I can give my simple approach: think you are using your model to help others as you offering service based on your model. (like start planting seeds when temp raise above 0 for next 5 months)
my guess is training/test splits of data represent customer only little: *customer is more likely to dont know if model was trained on 10yrs or 9yrs of data; if model is trained well with 1 year data - its good (and many times computing 10yrs can become time-endless or too long i only guess this; whole point of using many data to fit model is to grasp "complexity" a.k.a. top/bottom of data (like when things start appearing again and again, we noticed aha this is whats these data is about/behave
its about stationarity of data source: like do wheater act on same principles like 100 years ago? i guess yes; so feeding model with 100 years of data allow to grasp it. but maybe its enough 10 years of data...i mean, only something that repeats, is able to forecast. many things are changing, but many changes are repeatable; if i know it is changing, its repeating of change :-)
proving in fit (training set) that data is fitten well - in like 200 years of data of hour points .. if model fits very well (no overfitting ofcourse); there is very very good chance for next 5 monts it remain same (data source stationarity)
if there is like "hm i didnt involve data that near, river was created so its more cold" ,, if "all" data that have huge influence on y data are involved in model; for me; its good model
this cross validation things set/training data i guess was made by model ppl, building bridge way to usage ppl; but as consumer of model we always do out of sample prediction; so traing set is often all data currently available. cross val is very good imho it is cheap in terms model is fitted once then making predictions
so for metrics here; is things like "rolling"-making forecasts and then "comparing" to real value when it comes; as in reality, it simulate fitting model and making prediction iterativly; like in reality we sit in building near field making forecast for example every day, and when model predict "yes you can start planting" we do.
cross validation is nice, simple good thing. rolling/simulation of customer usage is nice thing
i didnt have any article/book, I just wanted to share simplified view;
im not expert, but if model parameters from 10% train set, are very different from model trained on 80% data - i would give myself a little pause, because model is just structure of expected dependencies; if parameters from 10% are different from 80; its like "regime swtiching", but it should be included in model (like river mentioned above), or create other model which is grasping changes original model was not unexpecting to come/didnt pay attention is happening (or wasnt spotted in early stages) .. good thing is when model grasp good structure, so model with 10% data provides "same performance" as 80%; if parameters remains simmiliar(or same), this would i call robustnes of parameters; or robustnes of model; that it works stable over whole data. "it models the data well :-)"