Solved – Variance of $K$-fold cross-validation estimates as $f(K)$: what is the role of “stability”

cross-validationmachine learningpredictive-modelsregressionvariance

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) — that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) — yields estimates of the generalization error that are the least variable for any $K$, not the most variable, assuming a certain stability condition on either the model/algorithm, the dataset, or both (I'm not sure which is correct as I don't really understand this stability condition).

  • Can someone clearly explain what exactly this stability condition is?
  • Is it true that linear regression is one such "stable" algorithm, implying that in that context, LOO-CV is strictly the best choice of CV as far as bias and variance of the estimates of generalization error are concerned?

The conventional wisdom is that the choice of $K$ in $K$-fold CV follows a bias-variance tradeoff, such lower values of $K$ (approaching 2) lead to estimates of the generalization error that have more pessimistic bias, but lower variance, while higher values of $K$ (approaching $N$) lead to estimates that are less biased, but with greater variance. The conventional explanation for this phenomenon of variance increasing with $K$ is given perhaps most prominently in The Elements of Statistical Learning (Section 7.10.1):

With K=N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N "training sets" are so similar to one another.

The implication being that the $N$ validation errors are more highly correlated so that their sum is more variable. This line of reasoning has been repeated in many answers on this site (e.g., here, here, here, here, here, here, and here) as well as on various blogs and etc. But a detailed analysis is virtually never given, instead only an intuition or brief sketch of what an analysis might look like.

One can however find contradictory statements, usually citing a certain "stability" condition that I don't really understand. For example, this contradictory answer quotes a couple paragraphs from a 2015 paper which says, among other things, "For models/modeling procedures with low instability, LOO often has the smallest variability" (emphasis added). This paper (section 5.2) seems to agree that LOO represents the least variable choice of $K$ as long as the model/algorithm is "stable." Taking even another stance on the issue, there is also this paper (Corollary 2), which says "The variance of $k$ fold cross validation […] does not depend on $k$," again citing a certain "stability" condition.

The explanation about why LOO might be the most variable $K$-fold CV is intuitive enough, but there is a counter-intuition. The final CV estimate of the mean squared error (MSE) is the mean of the MSE estimates in each fold. So as $K$ increases up to $N$, the CV estimate is the mean of an increasing number of random variables. And we know that the variance of a mean decreases with the number of variables being averaged over. So in order for LOO to be the most variable $K$-fold CV, it would have to be true that the increase in variance due to the increased correlation among the MSE estimates outweighs the decrease in variance due to the greater number of folds being averaged over. And it is not at all obvious that this is true.

Having become thoroughly confused thinking about all this, I decided to run a little simulation for the linear regression case. I simulated 10,000 datasets with $N$=50 and 3 uncorrelated predictors, each time estimating the generalization error using $K$-fold CV with $K$=2, 5, 10, or 50=$N$. The R code is here. Here are the resulting means and variances of the CV estimates across all 10,000 datasets (in MSE units):

         k = 2 k = 5 k = 10 k = n = 50
mean     1.187 1.108  1.094      1.087
variance 0.094 0.058  0.053      0.051

These results show the expected pattern that higher values of $K$ lead to a less pessimistic bias, but also appear to confirm that the variance of the CV estimates is lowest, not highest, in the LOO case.

So it appears that linear regression is one of the "stable" cases mentioned in the papers above, where increasing $K$ is associated with decreasing rather than increasing variance in the CV estimates. But what I still don't understand is:

  • What precisely is this "stability" condition? Does it apply to models/algorithms, datasets, or both to some extent?
  • Is there an intuitive way to think about this stability?
  • What are other examples of stable and unstable models/algorithms or datasets?
  • Is it relatively safe to assume that most models/algorithms or datasets are "stable" and therefore that $K$ should generally be chosen as high as is computationally feasible?

Best Answer

This answer follows up on my answer in Bias and variance in leave-one-out vs K-fold cross validation that discusses why LOOCV does not always lead to higher variance. Following a similar approach, I will attempt to highlight a case where LOOCV does lead to higher variance in the presence of outliers and an "unstable model".

Algorithmic stability (learning theory)

The topic of algorithmic stability is a recent one and several classic, infuential results have been proven in the past 20 years. Here are a few papers which are often cited

The best page to gain an understanding is certainly the wikipedia page which provides a excellent summary written by a presumably very knowledgeable user.

Intuitive definition of stability

Intuitively, a stable algorithm is one for which the prediction does not change much when the training data is modified slightly.

Formally, there are half a dozen versions of stability, linked together by technical conditions and hierarchies, see this graphic from here for example:

enter image description here

The objective however is simple, we want to get tight bounds on the generalization error of a specific learning algorithm, when the algorithm satisfies the stability criterion. As one would expect, the more restrictive the stability criterion, the tighter the corresponding bound will be.

Notation

The following notation is from the wikipedia article, which itself copies the Bousquet and Elisseef paper:

  • The training set $S = \{ z_1 = (x_1,y_1), ..., z_m = (x_m, y_m)\}$ is drawn i.i.d. from an unknown distribution D
  • The loss function $V$ of a hypothesis $f$ with respect to an example $z$ is defined as $V(f,z)$
  • We modify the training set by removing the $i$-th element: $S^{|i} = \{ z_1,...,z_{i-1}, z_{i+1},...,z_m\}$
  • Or by replacing the the $i$-th element: $S^{i} = \{ z_1,...,z_{i-1}, z_i^{'}, z_{i+1},...,z_m\}$

Formal definitions

Perhaps the strongest notion of stability that an interesting learning algorithm might be expected to obey is that of uniform stability:

Uniform stability An algorithm has uniform stability $\beta$ wth respect to the loss function $V$ if the following holds:

$$\forall S \in Z^m \ \ \forall i \in \{ 1,...,m\}, \ \ \sup | V(f_s,z) - V(f_{S^{|i},z}) |\ \ \leq \beta$$

Considered as a function of $m$, the term $\beta$ can be written as $\beta_m$. We say the algorithm is stable when $\beta_m$ decreases as $\frac{1}{m}$. A slightly weaker form of stability is:

Hypothesis stability

$$\forall i \in \{ 1,...,m\}, \ \ \mathbb{E}[\ | V(f_s,z) - V(f_{S^{|i},z}) |\ ] \ \leq \beta$$

If one point is removed, the difference in the outcome of the learning algorithm is measured by the averaged absolute difference of the losses ($L_1$ norm). Intuitively: small changes in the sample can only cause the algorithm to move to nearby hypotheses.

The advantage of these forms of stability is that they provide bounds for the bias and variance of stable algorithms. In particular, Bousquet proved these bounds for Uniform and Hypothesis stability in 2002. Since then, much work has been done to try to relax the stability conditions and generalize the bounds, for example in 2011, Kale, Kumar, Vassilvitskii argue that mean square stability provides better variance quantitative variance reduction bounds.

Some examples of stable algorithms

The following algorithms have been shown to be stable and have proven generalization bounds:

  • Regularized least square regression (with appropriate prior)
  • KNN classifier with 0-1 loss function
  • SVM with a bounded kernel and large regularization constant
  • Soft margin SVM
  • Minimum relative entropy algorithm for classification
  • A version of bagging regularizers

An experimental simulation

Repeating the experiment from the previous thread (see here), we now introduce a certain ratio of outliers in the data set. In particular:

  • 97% of the data has $[-.5,.5]$ uniform noise
  • 3% of the data with $[-20,20]$ uniform noise

As the $3$ order polynomial model is not regularized, it will be heavily influenced by the presence of a few outliers for small data sets. For larger datasets, or when there are more outliers, their effect is smaller as they tend to cancel out. See below for two models for 60 and 200 data points.

enter image description here

Performing the simulation as previously and plotting the resulting average MSE and variance of the MSE gives results very similar to Experiment 2 of the Bengio & Grandvalet 2004 paper.

Left Hand Side: no outliers. Right Hand Side: 3% outliers.

enter image description here

enter image description here

(see the linked paper for explanation of the last figure)

Explanations

Quoting Yves Grandvalet's answer on the other thread:

Intuitively, [in the situation of unstable algorithms], leave-one-out CV may be blind to instabilities that exist, but may not be triggered by changing a single point in the training data, which makes it highly variable to the realization of the training set.

In practice it is quite difficult to simulate an increase in variance due to LOOCV. It requires a particular combination of instability, some outliers but not too many, and a large number of iterations. Perhaps this is expected since linear regression has been shown to be quite stable. An interesting experiment would be to repeat this for higher dimensional data and a more unstable algorithm (e.g. decision tree)