Solved – Inner loop overfitting in nested cross-validation

cross-validationmodel selection

I have implemented nested cross validation in Matlab for a classification problem. I have 56 features and 408 cases.
I am performing feature and model selection in the inner cross-validation loop, using 10 fold CV for both inner and outer loops.

In the inner CV loop I employ a sequential forward feature selection procedure, nested inside a grid search to determine the optimum regularization parameters for a given regularized discriminant classifier model and selected feature set.
I am finding very poor performance on the outer loop CV (when compared to a standard quadratic discriminant classifier, obtained with single loop CV), leading me to conclude I am overfitting in the inner CV loop.

I have read papers by Cawley and Talbot that show that biased model selection protocols favour worse models, and furthermore, that the inner loop procedure alone produces a biased performance estimate.

Does this explain the overfitting observed for the inner loop in nested CV?

Are there any practical strategies to avoid this overfitting?

Best Answer

Overfitting in model selection problems for classification is usually due to including too many parameters. Cross-validation or bootstrap error rate estimation should avoid this problem because it avoids the optimism of an estimate like resubstitution which tests the classifier on the same data used in the fit. If you minimize the cross-validated estimate of error rate in your inner loop as the criterion for variable selection you should not have this problem. Am I correct in assuming that your selection procedure does not do this? If so you are probably using a procedure that is biased toward models with many parameters that may be poor models for prediction.

Related Solutions

Solved – Nested cross-validation for classification in MATLAB

The purpose of the outer cross-validation (CV) is to get an estimate of the classifier's performance on genuinely unseen data. If the hyperparameters are tuned based on a cross-validation statistic this can lead to a biased performance estimate and so an outer loop, which was not involved in any aspect of feature or model selection is needed to determine the performance estimate. Conversely if you do not tune the hyperparameters (and use default hyperparameters in SVM_train and SVM-classify) you do not need an outer cross-validation loop.

Here is an example of some code that will implement nested CV, this implementation uses Nelder-Mead optimization (NMO) and sequential forward feature selection in the inner loop to find the optimum feature set and hyperparameters (box-constraint (C) and RBF sigma).

Data are the data to be classified (Dimension: Cases x Features)

Labels are the class labels for each case

%************** Nested cross-validation ******************
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize the classifier performance object
for i = 1:length(Labels)
    test = zeros(size(Labels));
    test(i) = 1; test = logical(test); train = ~test;
    disp(sprintf('Fold: %d of %d.\n',i,length(Labels)))

    %************** Perform feature selection ************
    z0 = [0,0];    % z=[rbf_sigma,boxconstraint] - set to default exp(z) = [0,0]
    [rbf_sigma_Acc(i) boxconstraint_Acc(i) maxAcc Features{i}] = SVM_NMO(z0,Data(train,:),Labels(train),num_folds);

    %***************** Outer loop CV *********************
    svmStruct = svmtrain(Data(train,Features{i}),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma_Acc(i),'boxconstraint',boxconstraint_Acc(i));
    class = svmclassify(svmStruct,Data(test,Features{i}));    % updates the CP object with the current classification results
    classperf(Results,class,test);
    Acc_fold(i) = Results.LastCorrectRate;    
    disp(sprintf('Test set Accuracy (Fold %d): %2.2f',i,Acc_fold(i)))
    disp(sprintf('Test set Accuracy (running mean): %2.2f\n',100*Results.CorrectRate))
end

function [rbf_sigma boxconstraint Acc Features_opt] = SVM_NMO(z0,Data,Labels,num_folds)
opts = optimset('TolX',1e-1,'TolFun',1e-1);
fun = @(z)SVM_min_fn(Data,Labels,exp(z(1)),exp(z(2)),num_folds);
[z_opt,Crit] = fminsearch(fun,z0,opts);
[~, Features_opt] = fun(z_opt);

%************ Get optimal results **************
Acc = 1 - Crit;                       % Accuracy for model  
rbf_sigma = exp(z_opt(1));
boxconstraint = exp(z_opt(2));
disp(sprintf('Max Acc: %2.2f, RBF sigma: %1.2f. Boxconstraint: %1.2f',Acc,rbf_sigma,boxconstraint))


function [Crit Features] = SVM_min_fn(Data,Labels,rbf_sigma,boxconstraint,num_folds)
direction = 'forward';
opts = statset('display','iter');
kernel = 'rbf';

disp(sprintf('RBF sigma: %1.4f. Boxconstraint: %1.4f',rbf_sigma,boxconstraint))
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter','TolFun',1e-3);
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'direction',direction,'options',opts);

Features = find(fs==1);        % Features selected for given sigma and C
[Crit,h] = min(history.Crit);  % Mean classification error

Hope this helps

Solved – Nested cross validation for model selection

How do I choose a model from this [outer cross validation] output?

Short answer: You don't.

Treat the inner cross validation as part of the model fitting procedure. That means that the fitting including the fitting of the hyper-parameters (this is where the inner cross validation hides) is just like any other model esitmation routine.
The outer cross validation estimates the performance of this model fitting approach. For that you use the usual assumptions

the $k$ outer surrogate models are equivalent to the "real" model built by model.fitting.procedure with all data.
Or, in case 1. breaks down (pessimistic bias of resampling validation), at least the $k$ outer surrogate models are equivalent to each other.
This allows you to pool (average) the test results. It also means that you do not need to choose among them as you assume that they are basically the same. The breaking down of this second, weaker assumption is model instability.

Do not pick the seemingly best of the $k$ surrogate models - that would usually be just "harvesting" testing uncertainty and leads to an optimistic bias.

So how can I use nested CV for model selection?

The inner CV does the selection.

It looks to me that selecting the best model out of those K winning models would not be a fair comparison since each model was trained and tested on different parts of the dataset.

You are right in that it is no good idea to pick one of the $k$ surrogate models. But you are wrong about the reason. Real reason: see above. The fact that they are not trained and tested on the same data does not "hurt" here.

Not having the same testing data: as you want to claim afterwards that the test results generalize to never seen data, this cannot make a difference.
Not having the same training data:
- if the models are stable, this doesn't make a difference: Stable here means that the model does not change (much) if the training data is "perturbed" by replacing a few cases by other cases.
- if the models are not stable, three considerations are important:
  1. you can actually measure whether and to which extent this is the case, by using iterated/repeated $k$-fold cross validation. That allows you to compare cross validation results for the same case that were predicted by different models built on slightly differing training data.
  2. If the models are not stable, the variance observed over the test results of the $k$-fold cross validation increases: you do not only have the variance due to the fact that only a finite number of cases is tested in total, but have additional variance due to the instability of the models (variance in the predictive abilities).
  3. If instability is a real problem, you cannot extrapolate well to the performance for the "real" model.

Which brings me to your last question:

What types of analysis /checks can I do with the scores that I get from the outer K folds?

check for stability of the predictions (use iterated/repeated cross-validation)
check for the stability/variation of the optimized hyper-parameters.
For one thing, wildly scattering hyper-parameters may indicate that the inner optimization didn't work. For another thing, this may allow you to decide on the hyperparameters without the costly optimization step in similar situations in the future. With costly I do not refer to computational resources but to the fact that this "costs" information that may better be used for estimating the "normal" model parameters.
check for the difference between the inner and outer estimate of the chosen model. If there is a large difference (the inner being very overoptimistic), there is a risk that the inner optimization didn't work well because of overfitting.

update @user99889's question: What to do if outer CV finds instability?

First of all, detecting in the outer CV loop that the models do not yield stable predictions in that respect doesn't really differ from detecting that the prediciton error is too high for the application. It is one of the possible outcomes of model validation (or verification) implying that the model we have is not fit for its purpose.

In the comment answering @davips, I was thinking of tackling the instability in the inner CV - i.e. as part of the model optimization process.

But you are certainly right: if we change our model based on the findings of the outer CV, yet another round of independent testing of the changed model is necessary.
However, instability in the outer CV would also be a sign that the optimization wasn't set up well - so finding instability in the outer CV implies that the inner CV did not penalize instability in the necessary fashion - this would be my main point of critique in such a situation. In other words, why does the optimization allow/lead to heavily overfit models?

However, there is one peculiarity here that IMHO may excuse the further change of the "final" model after careful consideration of the exact circumstances: As we did detect overfitting, any proposed change (fewer d.f./more restrictive or aggregation) to the model would be in direction of less overfitting (or at least hyperparameters that are less prone to overfitting). The point of independent testing is to detect overfitting - underfitting can be detected by data that was already used in the training process.

So if we are talking, say, about further reducing the number of latent variables in a PLS model that would be comparably benign (if the proposed change would be a totally different type of model, say PLS instead of SVM, all bets would be off), and I'd be even more relaxed about it if I'd know that we are anyways in an intermediate stage of modeling - after all, if the optimized models are still unstable, there's no question that more cases are needed. Also, in many situations, you'll eventually need to perform studies that are designed to properly test various aspects of performance (e.g. generalization to data acquired in the future). Still, I'd insist that the full modeling process would need to be reported, and that the implications of these late changes would need to be carefully discussed.

Also, aggregation including and out-of-bag analogue CV estimate of performance would be possible from the already available results - which is the other type of "post-processing" of the model that I'd be willing to consider benign here. Yet again, it then would have been better if the study were designed from the beginning to check that aggregation provides no advantage over individual predcitions (which is another way of saying that the individual models are stable).

Update (2019): the more I think about these situations, the more I come to favor the "nested cross validation apparently without nesting" approach.

Best Answer

Related Solutions

Solved – Nested cross-validation for classification in MATLAB

Solved – Nested cross validation for model selection

update @user99889's question: What to do if outer CV finds instability?

Related Question