Solved – Nested cross validation vs repeated k-fold

cross-validation

I know there are many topics(1,2,3), papers(1,2,3) and websites(1) that discuss this topic at length. However for the past two days I am reading all I can find about the subject and seems I hit a brick wall in terms of progress.

As so, another view on the subject would be very appreciated.

For what I understood, we use nested cross validation when we have several models, each of those with some amount to hyper-parameters to tune. Here is a an example where, as far I can tell, nested cross validation would be used:

Lets say we have a data set composed by 100 observations and we want to see which of the following models(with hyper-parameters in parenthesis) is the best one:

-Neural Network(number of hidden layers, number of neurons in each hidden layer, activation function)

-KNN(Number of neighbors, distance measure)

-SVM(Type of kernel function, kernel function parameters)

As the number of observations is small we can't afford to separate three disjoints sets (training, validation, test) because we will probably end up with an worst model since it was trained with less data than is available. To fix that we will implement a cross validation strategy. (is this correct so far?)

In the inner loop of the nested cross validation, we would choose the best hyper-parameters combination for each model. after that we would train each model with the combined data from the inner loop, and then compare then in the outer loop cross-validation. the model with smaller error measure would be considered the best. This model will then be trained with the whole data set and be used for future predictions.

My main doubt is: why can't I do repeated cross validation to get the same results? As an example lets say I used repeated k-fold. In that case I would train every model and hyper-parameter combination in the training set and evaluate its performance in the test set for each k-fold split. I would then choose the model and hyper-parameters that gave the smaller mean error across all repetitions. Finally The best model is fitted to the whole data similar to the nested cross validation example.

As I understand, the estimated error would be biased in the repeated cross validation example. This happens because there is some leakage of information since we are using the same data for model selection and hyper-parameters tuning, and model assessment . However if I am only interested in choosing the best model is nested cross validation really necessary?

Feel free to correct any wrong assumption or improper terminology usage as I am fairly new to this field and any help would be greatly appreciated.

Best Answer

Nested cross-validation and repeated k-fold cross-validation have different aims. The aim of nested cross-validation is to eliminate the bias in the performance estimate due to the use of cross-validation to tune the hyper-parameters. As the "inner" cross-validation has been directly optimised to tune the hyper-parameters it will give an optimistically biased estimate of generalisation performance. The aim of repeated k-fold cross-validation, on the other hand, is to reduce the variance of the performance estimate (to average out the random variation caused by partitioning the data into folds). If you want to reduce bias and variance, there is no reason (other than computational expense) not to combine both, such that repeated k-fold is used for the "outer" cross-validation of a nested cross-validation estimate. Using repeated k-fold cross-validation for the "inner" folds, might also improve the hyper-parameter tuning.

If all of the models have only a small number of hyper-parameters (and they are not overly sensitive to the hyper-parameter values) then you can often get away with a non-nested cross-validation to choose the model, and only need nested cross-validation if you need an unbiased performance estimate, see:

Jacques Wainer and Gavin Cawley, "Nested cross-validation when selecting classifiers is overzealous for most practical applications", Expert Systems with Applications, Volume 182, 2021 (doi, pdf)

If, on the other hand, some models have more hyper-parameters than others, the model choice will be biased towards the models with the most hyper-parameters (which is probably a bad thing as they are the ones most likely to experience over-fitting in model selection). See the comparison of RBF kernels with a single hyper-parameter and Automatic Relevance Determination (ARD) kernels, with one hyper-parameter for each attribute, in section 4.3 my paper (with Mrs Marsupial):

GC Cawley and NLC Talbot, "On over-fitting in model selection and subsequent selection bias in performance evaluation", The Journal of Machine Learning Research 11, 2079-2107, 2010 (pdf)

The PRESS statistic (which is the inner cross-validation) will almost always select the ARD kernel, despite the RBF kernel giving better generalisation performance in the majority of cases (ten of the thirteen benchmark datasets).