why would models learned with leave-one-out CV have higher variance?
[TL:DR] A summary of recent posts and debates (July 2018)
This topic has been widely discussed both on this site, and in the scientific literature, with conflicting views, intuitions and conclusions. Back in 2013 when this question was first asked, the dominant view was that LOOCV leads to larger variance of the expected generalization error of a training algorithm producing models out of samples of size $n(K−1)/K$.
This view, however, appears to be an incorrect generalization of a special case and I would argue that the correct answer is: "it depends..."
Paraphrasing Yves Grandvalet the author of a 2004 paper on the topic I would summarize the intuitive argument as follows:
- If cross-validation were averaging independent estimates: then leave-one-out CV one should see relatively lower variance between models since we are only shifting one data point across folds and therefore the training sets between folds overlap substantially.
- This is not true when training sets are highly correlated: Correlation may increase with K and this increase is responsible for the overall increase of variance in the second scenario. Intuitively, in that situation, leave-one-out CV may be blind to instabilities that exist, but may not be triggered by changing a single point in the training data, which makes it highly variable to the realization of the training set.
Experimental simulations from myself and others on this site, as well as those of researchers in the papers linked below will show you that there is no universal truth on the topic. Most experiments have monotonically decreasing or constant variance with $K$, but some special cases show increasing variance with $K$.
The rest of this answer proposes a simulation on a toy example and an informal literature review.
[Update] You can find here an alternative simulation for an unstable model in the presence of outliers.
Simulations from a toy example showing decreasing / constant variance
Consider the following toy example where we are fitting a degree 4 polynomial to a noisy sine curve. We expect this model to fare poorly for small datasets due to overfitting, as shown by the learning curve.
Note that we plot 1 - MSE here to reproduce the illustration from ESLII page 243
Methodology
You can find the code for this simulation here. The approach was the following:
- Generate 10,000 points from the distribution $sin(x) + \epsilon$ where the true variance of $\epsilon$ is known
- Iterate $i$ times (e.g. 100 or 200 times). At each iteration, change the dataset by resampling $N$ points from the original distribution
- For each data set $i$:
- Perform K-fold cross validation for one value of $K$
- Store the average Mean Square Error (MSE) across the K-folds
- Once the loop over $i$ is complete, calculate the mean and standard deviation of the MSE across the $i$ datasets for the same value of $K$
- Repeat the above steps for all $K$ in range $\{ 5,...,N\}$ all the way to Leave One Out CV (LOOCV)
Impact of $K$ on the Bias and Variance of the MSE across $i$ datasets.
Left Hand Side: Kfolds for 200 data points, Right Hand Side: Kfolds for 40 data points
Standard Deviation of MSE (across data sets i) vs Kfolds
From this simulation, it seems that:
- For small number $N = 40$ of datapoints, increasing $K$ until $K=10$ or so significantly improves both the bias and the variance. For larger $K$ there is no effect on either bias or variance.
- The intuition is that for too small effective training size, the polynomial model is very unstable, especially for $K \leq 5$
- For larger $N = 200$ - increasing $K$ has no particular impact on both the bias and variance.
An informal literature review
The following three papers investigate the bias and variance of cross validation
Kohavi 1995
This paper is often refered to as the source for the argument that LOOC has higher variance. In section 1:
“For example, leave-oneout is almost unbiased, but it has high variance, leading to unreliable estimates (Efron 1983)"
This statement is source of much confusion, because it seems to be from Efron in 1983, not Kohavi. Both Kohavi's theoretical argumentations and experimental results go against this statement:
Corollary 2 ( Variance in CV)
Given a dataset and an inducer. If the inducer is stable under the perturbations caused by deleting the test instances for the folds in k-fold CV for various values of $k$, then the variance of the estimate will be the same
Experiment
In his experiment, Kohavi compares two algorithms: a C4.5 decision tree and a Naive Bayes classifier across multiple datasets from the UC Irvine repository. His results are below: LHS is accuracy vs folds (i.e. bias) and RHS is standard deviation vs folds
In fact, only the decision tree on three data sets clearly has higher variance for increasing K. Other results show decreasing or constant variance.
Finally, although the conclusion could be worded more strongly, there is no argument for LOO having higher variance, quite the opposite. From section 6. Summary
"k-fold cross validation with moderate k values (10-20) reduces the variance... As k-decreases (2-5) and the samples get smaller, there is variance due to instability of the training sets themselves.
Zhang and Yang
The authors take a strong view on this topic and clearly state in Section 7.1
In fact, in least squares linear regression, Burman (1989) shows that among the k-fold CVs, in estimating the prediction error, LOO (i.e., n-fold CV) has the smallest asymptotic bias and variance. ...
... Then a theoretical calculation (Lu, 2007) shows that LOO has the smallest bias and variance at the same time among all delete-n CVs with all possible n_v deletions considered
Experimental results
Similarly, Zhang's experiments point in the direction of decreasing variance with K, as shown below for the True model and the wrong model for Figure 3 and Figure 5.
The only experiment for which variance increases with $K$ is for the Lasso and SCAD models. This is explained as follows on page 31:
However, if model selection is involved, the performance of LOO worsens in variability as the model selection uncertainty gets higher due to large model space, small penalty coefficients and/or the use of data-driven penalty coefficients
Setting NaNs to the mean of the training or test set is ok?
You need to define a procedure that you always follow. I see two valid options here:
- either use some value (e.g. mean) calculated for that subject (see also below)
- or some value calculated for the training set, basically a hyperparameter "value to be used for replacing NAs". This should not be calculated from the whole test set (independent testing also means that no parameters calculated from other test subjects should be used: the processing should not depend on the composition of the test set).
Edit:
Which method to follow (replace by value computed within subject or by value computed within training set) should IMHO be decided from the knowledge about the application and the data, we cannot tell you more than very general guidelines here.
Why not by value computed within test set?: That would mean that the value used to replace NA
s in test subject A depends on whether subject A is tested together with subject B or subject C – which doesn't seem to be a desirable or sensible behaviour to me.
You may also want to look up "Imputation" which is the general term for techniques that fill in missing values.
Centering and scaling (standardization): if you have "external" (scientific) knowledge that suggests that a standardization within the subjects should take place, then go ahead with that. Whether this is sensible depends on your application and data, so we cannot answer this question.
For a more general discussion of centering and standardization, see e.g. Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one? and When conducting multiple regression, when should you center your predictor variables & when should you standardize them?
Now within each outer fold I plan to tune a classifier's parameter with help of another cross-validation.
With 4 subjects you probably won't be able to compare classifiers anyways: the uncertainty due to having only 4 test cases is far too high. Can't you fix this parameter by experience with similar data?
To illustrate the problem: assume you observe 4 correct classifications out of 4 test subjects. That gives you a point estimate of 100% correct predictions. If you look at confidence interval calculations for this, you get e.g. for the Agresti-Coull method a 95% ci of 45 - 105% (obviously not very precise with the small sample size), Bayes method with uniform prior makes it 55 - 100%. In any case it means that even if you observe perfect test results, it is not quite clear whether you can claim that the model is actually better than guessing. As long as you do not need to fear that fixing the parameter beforehand will produce a model that is clearly worse than guessing, you anyways cannot measure improvements in the practically important range.
The situation may be less drastic if you optimize e.g. Brier score but with 4 subjects I'd suspect that you still do not reach the precision you need for the expected improvement during the optimization.
Edit: Unfortunately, while 20 subjects are far more than 4, from a classifier validation statistics point of view, 20 is still very few.
We recently concluded that if you need to stick with the frequently used proportions for characterizing your classifier, at least in our field you cannot expect to have a useful precision in the test results with less than 75 - 100 test subjects (in the denominator of the proportion!). Again, you may be better off if you can switch to e.g. Brier's score, and with a paired design
for classifier comparison, but I'd call it lucky if that gains you a factor of 5 in sample size.
You can find our thoughts here: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
AFAIK, dealing with the random uncertainty on test results during classifier optimization is an unsolved problem. (If not, I'd be extremely interested to papers about the solution!)
So my recommendation would be to do a preliminary experiment/analysis at the end of which you try to estimate the random uncertainty on the comparison results. If these do not allow to optimize (which I'd unfortunately expect to be the outcome), report this result and argue that in consequence you do not have any choice at the moment but fixing the hyper-parameters to some sensible (though not optimized) value.
Does the inner cross-validation necessarily need to be leave-one-subject-out as well?
If you do inner cross validation it would be better to do it subject-wise as well: without this, you'll get overly optimistic inner results. Which would not a problem iff the bias were constant. However, it usually isn't and you have the additional problem that due to the random uncertainty together with the optimistic bias you may observe many models that seem to be perfect. Among these you cannot distinguish (after all, they all seem to be perfect) which can completely mess up the optimization.
Again, with so few subjects I'd avoid this inner optimization and fix the parameter.
Best Answer
Just to add slightly to the answer of @SubravetiSuraj (+1)
Cross-validation gives a pessimistically biased estimate of performance because most statistical models will improve if the training set is made larger. This means that k-fold cross-validation estimates the performance of a model trained on a dataset $100\times\frac{(k-1)}{k}\%$ of the available data, rather than on 100% of it. So if you perform cross-validation to estimate performance, and then use a model trained on all of the data for operational use, it will perform slightly better than the cross-validation estimate suggests.
Leave-one-out cross-validation is approximately unbiased, because the difference in size between the training set used in each fold and the entire dataset is only a single pattern. There is a paper on this by Luntz and Brailovsky (in Russian).
Luntz, Aleksandr, and Viktor Brailovsky. "On estimation of characters obtained in statistical procedure of recognition." Technicheskaya Kibernetica 3.6 (1969): 6–12.
see also
Lachenbruch,Peter A., and Mickey, M. Ray. "Estimation of Error Rates in Discriminant Analysis." Technometrics 10.1 (1968): 1–11.
However, while leave-one-out cross-validation is approximately unbiased, it tends to have a high variance (so you would get very different estimates if you repeated the estimate with different initial samples of data from the same distribution). As the error of the estimator is a combination of bias and variance, whether leave-one-out cross-validation is better than 10-fold cross-validation depends on both quantities.
Now the variance in fitting the model tends to be higher if it is fitted to a small dataset (as it is more sensitive to any noise/sampling artifacts in the particular training sample used). This means that 10-fold cross-validation is likely to have a high variance (as well as a higher bias) if you only have a limited amount of data, as the size of the training set will be smaller than for LOOCV. So k-fold cross-validation can have variance issues as well, but for a different reason. This is why LOOCV is often better when the size of the dataset is small.
However, the main reason for using LOOCV in my opinion is that it is computationally inexpensive for some models (such as linear regression, most kernel methods, nearest-neighbour classifiers, etc.), and unless the dataset were very small, I would use 10-fold cross-validation if it fitted in my computational budget, or better still, bootstrap estimation and bagging.