Machine Learning – Why K-Fold Cross Validation is Better than K-Times Resampling

cross-validationmachine learning

I'm currently working through a machine learning textbook and just read a bit about k-fold cross validation, and I am wondering the following. I want to estimate a parameter, e.g. a penalty parameter for a penalized likelihood method. In order to do this, I can do two different things:

  1. I sample the training data so that I get $k$ equally large folds, and for each fold I use the other folds as training data to get estimates for $y$ and I compare these estimates with the actual $y$ from the fold in question. This, I do for every interesting choice of my parameter, and choose the parameter which has the least error, averaged over all folds and all members of each fold.

  2. I sample the training data so I get 2 equally large sets, one of which I use as training data to predict the error of the other set. For every interesting lambda, I note the average error. Then, I re-sample the data so I get 2 (different) equally large sets, where I repeat the above procedure. I sample $k$ times in total, and average over these to get an estimate to the best parameter.

The second approach looks rather naive, and I am wondering if there is something wrong with it. Are there reasons, generally speaking, why one would prefer method 1 over method 2? Are there computational reasons, or even statistical ones?

Best Answer

The problem with the second approach is that the training set is smaller (half of the available data) than for the cross-validation approach ((k-1)/k of the available data). As most learning algorithms perform better the more data they are trained on, this means that the second approach gives a more pessimistically biased estimate of the performance of a model trained on all of the available data than the cross-validation based approach. Taken to its extreme, where k is the size of the available dataset (i.e. leave-one-out cross-validation) gives an almost unbiased estimate of generalisation performance.

However, as well as bias (whether the estimate is systematically wrong), there is also variance (how much the estimate varies depending on the selection of data over which it is calculated). If we use more data for training, this also reduces the variability of the performance of the resulting model, but it leaves less testing data, so the variance of the performance estimate increases. This means that there is usually a compromise between variance and bias in determining how much data can be used for training and for testing in each fold (i.e. in practice, leave-one-out cross-validation isn't optimal as while it is almost unbiased, it has a high variance, so the estimator has a higher error).

The more folds of the resampling procedure we use, the more we can reduce the variance of the estimator. With split sampling that is easy, just increase the number of folds. For cross-validation, we can perform cross-validation repeatedly, choosing a different partition of the data into k disjoint subsets each time and average. I often perform 100 random test-training splits (i.e. the second approach) but use a 90%/10% split between training and test data to reduce the bias of the estimator.