Solved – Does k-fold cross validation always imply k uniformly sized subsets

cross-validation

I'm a bit confused on a minor point that I'm trying to discern due to a cross-validation strategy I've come across in my work that creates k-folds but the folds are not of equal length (for example some folds are of size 17, another 18, on up to 24). Is k-folds cross validation constrained to folds of equal length? Arbitrary choices of training data length and fold count can yield fractional numbers of course where one fold will draw the short stick but would it be accurate to say k-fold attempts to make roughly equal fold sizes?

In particular I'm hearing contradictory messages from in this question

Matt Krause "divided into different, mutually-exclusive 'folds'"

a Data Head "k-fold cross-validation (kFCV) divides the N data points into k mutually exclusive subsets of equal size."

Best Answer

I didn't meant to imply that the folds should be different sizes in that answer (and I'll update it to match).

Each fold should contain an equal number of observations, or as close to equal as possible. If you want to perform 10-fold cross-validation on $N=101$ observations, one fold will have 11, rather than 10, items in it. That's fine. If you have 102 observations, it'd be best to have two folds of 11 items each, rather than 1 of 12 and 9 of 10, though I doubt this matters much in practice, particularly as $N/k$ increases.

There is no need to choose $k$ that divides evenly divides $N$ (how would you even run cross-validation on a prime-number-sized data set?), or discard examples until $N$ is evenly divisible by $k$ (data is hard to get; don't throw it away).

Algorithmic stability (learning theory)

The topic of algorithmic stability is a recent one and several classic, infuential results have been proven in the past 20 years. Here are a few papers which are often cited

Heuristics of instability and stabilization in model selection (1996): Leo Breiman
Algorithmic stability and sanity check bounds for LOOCV (1997): Kearns, Ron
Stability and Generalization(2002): Bousquet, Elisseef
Cross-Validation and Mean-Square Stability(2011) Kale, Kumar, Vassilvitskii
Almost everywhere algorithmic stability and generalization error(2012): Kutin, Niyogi
Some notes on the topic: University of Arizona

The best page to gain an understanding is certainly the wikipedia page which provides a excellent summary written by a presumably very knowledgeable user.

Intuitive definition of stability

Intuitively, a stable algorithm is one for which the prediction does not change much when the training data is modified slightly.

Formally, there are half a dozen versions of stability, linked together by technical conditions and hierarchies, see this graphic from here for example:

The objective however is simple, we want to get tight bounds on the generalization error of a specific learning algorithm, when the algorithm satisfies the stability criterion. As one would expect, the more restrictive the stability criterion, the tighter the corresponding bound will be.

Notation

The following notation is from the wikipedia article, which itself copies the Bousquet and Elisseef paper:

The training set $S = \{ z_1 = (x_1,y_1), ..., z_m = (x_m, y_m)\}$ is drawn i.i.d. from an unknown distribution D
The loss function $V$ of a hypothesis $f$ with respect to an example $z$ is defined as $V(f,z)$
We modify the training set by removing the $i$-th element: $S^{|i} = \{ z_1,...,z_{i-1}, z_{i+1},...,z_m\}$
Or by replacing the the $i$-th element: $S^{i} = \{ z_1,...,z_{i-1}, z_i^{'}, z_{i+1},...,z_m\}$

Formal definitions

Perhaps the strongest notion of stability that an interesting learning algorithm might be expected to obey is that of uniform stability:

Uniform stability An algorithm has uniform stability $\beta$ wth respect to the loss function $V$ if the following holds:

$$\forall S \in Z^m \ \ \forall i \in \{ 1,...,m\}, \ \ \sup | V(f_s,z) - V(f_{S^{|i},z}) |\ \ \leq \beta$$

Considered as a function of $m$, the term $\beta$ can be written as $\beta_m$. We say the algorithm is stable when $\beta_m$ decreases as $\frac{1}{m}$. A slightly weaker form of stability is:

Hypothesis stability

$$\forall i \in \{ 1,...,m\}, \ \ \mathbb{E}[\ | V(f_s,z) - V(f_{S^{|i},z}) |\ ] \ \leq \beta$$

If one point is removed, the difference in the outcome of the learning algorithm is measured by the averaged absolute difference of the losses ($L_1$ norm). Intuitively: small changes in the sample can only cause the algorithm to move to nearby hypotheses.

The advantage of these forms of stability is that they provide bounds for the bias and variance of stable algorithms. In particular, Bousquet proved these bounds for Uniform and Hypothesis stability in 2002. Since then, much work has been done to try to relax the stability conditions and generalize the bounds, for example in 2011, Kale, Kumar, Vassilvitskii argue that mean square stability provides better variance quantitative variance reduction bounds.

Some examples of stable algorithms

The following algorithms have been shown to be stable and have proven generalization bounds:

Regularized least square regression (with appropriate prior)
KNN classifier with 0-1 loss function
SVM with a bounded kernel and large regularization constant
Soft margin SVM
Minimum relative entropy algorithm for classification
A version of bagging regularizers

An experimental simulation

Repeating the experiment from the previous thread (see here), we now introduce a certain ratio of outliers in the data set. In particular:

97% of the data has $[-.5,.5]$ uniform noise
3% of the data with $[-20,20]$ uniform noise

As the $3$ order polynomial model is not regularized, it will be heavily influenced by the presence of a few outliers for small data sets. For larger datasets, or when there are more outliers, their effect is smaller as they tend to cancel out. See below for two models for 60 and 200 data points.

Performing the simulation as previously and plotting the resulting average MSE and variance of the MSE gives results very similar to Experiment 2 of the Bengio & Grandvalet 2004 paper.

Left Hand Side: no outliers. Right Hand Side: 3% outliers.

(see the linked paper for explanation of the last figure)

Explanations

Quoting Yves Grandvalet's answer on the other thread:

Intuitively, [in the situation of unstable algorithms], leave-one-out CV may be blind to instabilities that exist, but may not be triggered by changing a single point in the training data, which makes it highly variable to the realization of the training set.

In practice it is quite difficult to simulate an increase in variance due to LOOCV. It requires a particular combination of instability, some outliers but not too many, and a large number of iterations. Perhaps this is expected since linear regression has been shown to be quite stable. An interesting experiment would be to repeat this for higher dimensional data and a more unstable algorithm (e.g. decision tree)

Best Answer

Related Solutions

Solved – K-fold cross validation

Cross-Validation – Analyzing the Variance of K-Fold Cross-Validation Estimates and the Role of Stability