Solved – K-fold validation, how to use MSE and STD for model selection

biascross-validationmodel selectionvariance

When using K-fold validation for model selection I'm wondering what's the best approach to select a model using both the mean square error (MSE) and the standard deviation of errors among folds (STD).

Below I present the approach I follow. I would like to know if is a good approach, if the assumptions are correct and what other better approaches could be followed.

Basic Ideas

In general I assume these ideas:

MSE average between folds will give a good estimate of the generalization error of the model.
A low MSE average between folds will indicate that the model bias is low.
The standard deviation of the MSE among folds can be use as a good approach to understand variance.

Proposed Approach

If a model gives the lowest MSE and STD then pick it (is the model with lowest bias and variance)
If no model has the lowest values for MSE and STD, then pick the model that has the lowest bias (and the lowest variance) count, when compared to other models. This will be obtained in the following way: having models A,B,C and average MSE and STD values AMSE(A),ASTD(A), etc.., then A is the best model if

sum(AMSE(A)<AMSE(B),AMSE(A)<AMSE(C))+sum(ASTD(A)<ASTD(B),ASTD(A)<ASTD(C))

where true counts for 1 and false counts for 0.
If from domain knowledge it is known that the data will tend models to high variance or to high bias, then pick the model with lowest STD, or lowest MSE respectively.

What are the best approaches to follow? are these assumptions ok?

Best Answer

Nope; you can reason about bias and variance by comparing predictions with the true value, CV folds have nothing to do with that.

What you can do with MSE variance over CV folds is to use it to test weather the difference between averages is significant, thus whether it is even justified to say that they are not equivalent.

[TL:DR] A summary of recent posts and debates (July 2018)

This topic has been widely discussed both on this site, and in the scientific literature, with conflicting views, intuitions and conclusions. Back in 2013 when this question was first asked, the dominant view was that LOOCV leads to larger variance of the expected generalization error of a training algorithm producing models out of samples of size $n(K−1)/K$.

This view, however, appears to be an incorrect generalization of a special case and I would argue that the correct answer is: "it depends..."

Paraphrasing Yves Grandvalet the author of a 2004 paper on the topic I would summarize the intuitive argument as follows:

If cross-validation were averaging independent estimates: then leave-one-out CV one should see relatively lower variance between models since we are only shifting one data point across folds and therefore the training sets between folds overlap substantially.
This is not true when training sets are highly correlated: Correlation may increase with K and this increase is responsible for the overall increase of variance in the second scenario. Intuitively, in that situation, leave-one-out CV may be blind to instabilities that exist, but may not be triggered by changing a single point in the training data, which makes it highly variable to the realization of the training set.

Experimental simulations from myself and others on this site, as well as those of researchers in the papers linked below will show you that there is no universal truth on the topic. Most experiments have monotonically decreasing or constant variance with $K$, but some special cases show increasing variance with $K$.

The rest of this answer proposes a simulation on a toy example and an informal literature review.

[Update] You can find here an alternative simulation for an unstable model in the presence of outliers.

Simulations from a toy example showing decreasing / constant variance

Consider the following toy example where we are fitting a degree 4 polynomial to a noisy sine curve. We expect this model to fare poorly for small datasets due to overfitting, as shown by the learning curve.

Note that we plot 1 - MSE here to reproduce the illustration from ESLII page 243

Methodology

You can find the code for this simulation here. The approach was the following:

Generate 10,000 points from the distribution $sin(x) + \epsilon$ where the true variance of $\epsilon$ is known
Iterate $i$ times (e.g. 100 or 200 times). At each iteration, change the dataset by resampling $N$ points from the original distribution
For each data set $i$:
- Perform K-fold cross validation for one value of $K$
- Store the average Mean Square Error (MSE) across the K-folds
Once the loop over $i$ is complete, calculate the mean and standard deviation of the MSE across the $i$ datasets for the same value of $K$
Repeat the above steps for all $K$ in range $\{ 5,...,N\}$ all the way to Leave One Out CV (LOOCV)

Impact of $K$ on the Bias and Variance of the MSE across $i$ datasets.

Left Hand Side: Kfolds for 200 data points, Right Hand Side: Kfolds for 40 data points

Standard Deviation of MSE (across data sets i) vs Kfolds

From this simulation, it seems that:

For small number $N = 40$ of datapoints, increasing $K$ until $K=10$ or so significantly improves both the bias and the variance. For larger $K$ there is no effect on either bias or variance.
The intuition is that for too small effective training size, the polynomial model is very unstable, especially for $K \leq 5$
For larger $N = 200$ - increasing $K$ has no particular impact on both the bias and variance.

An informal literature review

The following three papers investigate the bias and variance of cross validation

Kohavi 1995

This paper is often refered to as the source for the argument that LOOC has higher variance. In section 1:

“For example, leave-oneout is almost unbiased, but it has high variance, leading to unreliable estimates (Efron 1983)"

This statement is source of much confusion, because it seems to be from Efron in 1983, not Kohavi. Both Kohavi's theoretical argumentations and experimental results go against this statement:

Corollary 2 ( Variance in CV)

Given a dataset and an inducer. If the inducer is stable under the perturbations caused by deleting the test instances for the folds in k-fold CV for various values of $k$, then the variance of the estimate will be the same

Experiment In his experiment, Kohavi compares two algorithms: a C4.5 decision tree and a Naive Bayes classifier across multiple datasets from the UC Irvine repository. His results are below: LHS is accuracy vs folds (i.e. bias) and RHS is standard deviation vs folds

In fact, only the decision tree on three data sets clearly has higher variance for increasing K. Other results show decreasing or constant variance.

Finally, although the conclusion could be worded more strongly, there is no argument for LOO having higher variance, quite the opposite. From section 6. Summary

"k-fold cross validation with moderate k values (10-20) reduces the variance... As k-decreases (2-5) and the samples get smaller, there is variance due to instability of the training sets themselves.

Zhang and Yang

The authors take a strong view on this topic and clearly state in Section 7.1

In fact, in least squares linear regression, Burman (1989) shows that among the k-fold CVs, in estimating the prediction error, LOO (i.e., n-fold CV) has the smallest asymptotic bias and variance. ...

... Then a theoretical calculation (Lu, 2007) shows that LOO has the smallest bias and variance at the same time among all delete-n CVs with all possible n_v deletions considered

Experimental results Similarly, Zhang's experiments point in the direction of decreasing variance with K, as shown below for the True model and the wrong model for Figure 3 and Figure 5.

The only experiment for which variance increases with $K$ is for the Lasso and SCAD models. This is explained as follows on page 31:

However, if model selection is involved, the performance of LOO worsens in variability as the model selection uncertainty gets higher due to large model space, small penalty coefficients and/or the use of data-driven penalty coefficients

Model Selection and Cross-Validation – The Right Strategies

My paper in JMLR addresses this exact question, and demonstrates why the procedure suggested in the question (or at least one very like it) results in optimistically biased performance estimates:

Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(Jul):2079−2107, 2010. (www)

The key thing to remember is that cross-validation is a technique for estimating the generalisation performance for a method of generating a model, rather than of the model itself. So if choosing kernel parameters is part of the process of generating the model, you need to cross-validate the model selection process as well, otherwise you will end up with an optimistically biased performance estimate (as will happen with the procedure you propose).

Assume you have a function fit_model, which takes in a dataset consisting of attributes X and desired responses Y, and which returns the fitted model for that dataset, including the tuning of hyper-parameters (in this case kernel and regularisation parameters). This tuning of hyper-parameters can be performed in many ways, for example minimising the cross-validation error over X and Y.

Step 1 - Fit the model to all available data, using the function fit_model. This gives you the model that you will use in operation or deployment.

Step 2 - Performance evaluation. Perform repeated cross-validation using all available data. In each fold, the data are partitioned into a training set and a test set. Fit the model using the training set (record hyper-parameter values for the fitted model) and evaluate performance on the test set. Use the mean over all of the test sets as a performance estimate (and perhaps look at the spread of values as well).

Step 3 - Variability of hyper-parameter settings - perform analysis of hyper-parameter values collected in step 3. However I should point out that there is nothing special about hyper-parameters, they are just parameters of the model that have been estimated (indirectly) from the data. They are treated as hyper-parameters rather than parameters for computational/mathematical convenience, but this doesn't have to be the case.

The problem with using cross-validation here is that the training and test data are not independent samples (as they share data) which means that the estimate of the variance of the performance estimate and of the hyper-parameters is likely to be biased (i.e. smaller than it would be for genuinely independent samples of data in each fold). Rather than repeated cross-validation, I would probably use bootstrapping instead and bag the resulting models if this was computationally feasible.

The key point is that to get an unbiased performance estimate, whatever procedure you use to generate the final model (fit_model) must be repeated in its entirety independently in each fold of the cross-validation procedure.