Cross Validation – When Is Nested Cross-Validation Practically Needed?

cross-validationmodel selectionridge regression

When using cross-validation to do model selection (such as e.g. hyperparameter tuning) and to assess the performance of the best model, one should use nested cross-validation. The outer loop is to assess the performance of the model, and the inner loop is to select the best model; the model is selected on each outer-training set (using the inner CV loop) and its performance is measured on the corresponding outer-testing set.

This has been discussed and explained in many threads (such as e.g. here Training with the full dataset after cross-validation?, see the answer by @DikranMarsupial) and is entirely clear to me. Doing only a simple (non-nested) cross-validation for both model selection & performance estimation can yield positively biased performance estimate. @DikranMarsupial has a 2010 paper on exactly this topic (On Over-fitting in Model Selection and Subsequent Selection Bias in
Performance Evaluation) with Section 4.3 being called Is Over-fitting in Model Selection Really a Genuine Concern in Practice? — and the paper shows that the answer is Yes.

All of that being said, I am now working with multivariate multiple ridge regression and I don't see any difference between simple and nested CV, and so nested CV in this particular case looks like an unnecessary computational burden. My question is: under what conditions will simple CV yield a noticeable bias that is avoided with nested CV? When does nested CV matter in practice, and when does it not matter that much? Are there any rules of thumb?

Here is an illustration using my actual dataset. Horizontal axis is $\log(\lambda)$ for ridge regression. Vertical axis is cross-validation error. Blue line corresponds to the simple (non-nested) cross-validation, with 50 random 90:10 training/test splits. Red line corresponds to the nested cross-validation with 50 random 90:10 training/test splits, where $\lambda$ is chosen with an inner cross-validation loop (also 50 random 90:10 splits). Lines are means over 50 random splits, shadings show $\pm1$ standard deviation.

Red line is flat because $\lambda$ is being selected in the inner loop and the outer-loop performance is not measured across the whole range of $\lambda$'s. If simple cross-validation were biased, then the minimum of the blue curve would be below the red line. But this is not the case.

Update

It actually is the case 🙂 It is just that the difference is tiny. Here is the zoom-in:

One potentially misleading thing here is that my error bars (shadings) are huge, but the nested and the simple CVs can be (and were) conducted with the same training/test splits. So the comparison between them is paired, as hinted by @Dikran in the comments. So let's take a difference between the nested CV error and the simple CV error (for the $\lambda=0.002$ that corresponds to the minimum on my blue curve); again, on each fold, these two errors are computed on the same testing set. Plotting this difference across $50$ training/test splits, I get the following:

Zeros correspond to splits where the inner CV loop also yielded $\lambda=0.002$ (it happens almost half of the times). On average, the difference tends to be positive, i.e. nested CV has a slightly higher error. In other words, simple CV demonstrates a minuscule, but optimistic bias.

(I ran the whole procedure a couple of times, and it happens every time.)

My question is, under what conditions can we expect this bias to be minuscule, and under what conditions should we not?

Best Answer

I would suggest that the bias depends on the variance of the model selection criterion, the higher the variance, the larger the bias is likely to be. The variance of the model selection criterion has two principal sources, the size of the dataset on which it is evaluated (so if you have a small dataset, the larger the bias is likely to be) and on the stability of the statistical model (if the model parameters are well estimated by the available training data, there is less flexibility for the model to over-fit the model selection criterion by tuning the hyper-parameters). The other relevant factor is the number of model choices to be made and/or hyper-parameters to be tune.

In my study, I am looking at powerful non-linear models and relatively small datasets (commonly used in machine learning studies) and both of these factors mean that nested cross-validation is absolutely neccsesary. If you increase the number of parameters (perhaps having a kernel with a scaling parameter for each attribute) the over-fitting can be "catastrophic". If you are using linear models with only a single regularisation parameter and a relatively large number of cases (relative to the number of parameters), then the difference is likely to be much smaller.

I should add that I would recommend always using nested cross-validation, provided it is computationally feasible, as it eliminates a possible source of bias so that we (and the peer-reviewers ;o) don't need to worry about whether it is negligible or not.

Related Solutions

Nested Cross-Validation – Addressing Inner Loop Overfitting in Nested Cross-Validation

Overfitting in model selection problems for classification is usually due to including too many parameters. Cross-validation or bootstrap error rate estimation should avoid this problem because it avoids the optimism of an estimate like resubstitution which tests the classifier on the same data used in the fit. If you minimize the cross-validated estimate of error rate in your inner loop as the criterion for variable selection you should not have this problem. Am I correct in assuming that your selection procedure does not do this? If so you are probably using a procedure that is biased toward models with many parameters that may be poor models for prediction.

Solved – Nested cross-validation for classification in MATLAB

The purpose of the outer cross-validation (CV) is to get an estimate of the classifier's performance on genuinely unseen data. If the hyperparameters are tuned based on a cross-validation statistic this can lead to a biased performance estimate and so an outer loop, which was not involved in any aspect of feature or model selection is needed to determine the performance estimate. Conversely if you do not tune the hyperparameters (and use default hyperparameters in SVM_train and SVM-classify) you do not need an outer cross-validation loop.

Here is an example of some code that will implement nested CV, this implementation uses Nelder-Mead optimization (NMO) and sequential forward feature selection in the inner loop to find the optimum feature set and hyperparameters (box-constraint (C) and RBF sigma).

Data are the data to be classified (Dimension: Cases x Features)

Labels are the class labels for each case

%************** Nested cross-validation ******************
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize the classifier performance object
for i = 1:length(Labels)
    test = zeros(size(Labels));
    test(i) = 1; test = logical(test); train = ~test;
    disp(sprintf('Fold: %d of %d.\n',i,length(Labels)))

    %************** Perform feature selection ************
    z0 = [0,0];    % z=[rbf_sigma,boxconstraint] - set to default exp(z) = [0,0]
    [rbf_sigma_Acc(i) boxconstraint_Acc(i) maxAcc Features{i}] = SVM_NMO(z0,Data(train,:),Labels(train),num_folds);

    %***************** Outer loop CV *********************
    svmStruct = svmtrain(Data(train,Features{i}),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma_Acc(i),'boxconstraint',boxconstraint_Acc(i));
    class = svmclassify(svmStruct,Data(test,Features{i}));    % updates the CP object with the current classification results
    classperf(Results,class,test);
    Acc_fold(i) = Results.LastCorrectRate;    
    disp(sprintf('Test set Accuracy (Fold %d): %2.2f',i,Acc_fold(i)))
    disp(sprintf('Test set Accuracy (running mean): %2.2f\n',100*Results.CorrectRate))
end

function [rbf_sigma boxconstraint Acc Features_opt] = SVM_NMO(z0,Data,Labels,num_folds)
opts = optimset('TolX',1e-1,'TolFun',1e-1);
fun = @(z)SVM_min_fn(Data,Labels,exp(z(1)),exp(z(2)),num_folds);
[z_opt,Crit] = fminsearch(fun,z0,opts);
[~, Features_opt] = fun(z_opt);

%************ Get optimal results **************
Acc = 1 - Crit;                       % Accuracy for model  
rbf_sigma = exp(z_opt(1));
boxconstraint = exp(z_opt(2));
disp(sprintf('Max Acc: %2.2f, RBF sigma: %1.2f. Boxconstraint: %1.2f',Acc,rbf_sigma,boxconstraint))


function [Crit Features] = SVM_min_fn(Data,Labels,rbf_sigma,boxconstraint,num_folds)
direction = 'forward';
opts = statset('display','iter');
kernel = 'rbf';

disp(sprintf('RBF sigma: %1.4f. Boxconstraint: %1.4f',rbf_sigma,boxconstraint))
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter','TolFun',1e-3);
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'direction',direction,'options',opts);

Features = find(fs==1);        % Features selected for given sigma and C
[Crit,h] = min(history.Crit);  % Mean classification error

Hope this helps

Update

Best Answer

Related Solutions

Nested Cross-Validation – Addressing Inner Loop Overfitting in Nested Cross-Validation

Solved – Nested cross-validation for classification in MATLAB

Related Question