Solved – Inner loop overfitting in nested cross-validation

cross-validationmodel selection

I have implemented nested cross validation in Matlab for a classification problem. I have 56 features and 408 cases.
I am performing feature and model selection in the inner cross-validation loop, using 10 fold CV for both inner and outer loops.

In the inner CV loop I employ a sequential forward feature selection procedure, nested inside a grid search to determine the optimum regularization parameters for a given regularized discriminant classifier model and selected feature set.
I am finding very poor performance on the outer loop CV (when compared to a standard quadratic discriminant classifier, obtained with single loop CV), leading me to conclude I am overfitting in the inner CV loop.

I have read papers by Cawley and Talbot that show that biased model selection protocols favour worse models, and furthermore, that the inner loop procedure alone produces a biased performance estimate.

Does this explain the overfitting observed for the inner loop in nested CV?

Are there any practical strategies to avoid this overfitting?

Best Answer

Overfitting in model selection problems for classification is usually due to including too many parameters. Cross-validation or bootstrap error rate estimation should avoid this problem because it avoids the optimism of an estimate like resubstitution which tests the classifier on the same data used in the fit. If you minimize the cross-validated estimate of error rate in your inner loop as the criterion for variable selection you should not have this problem. Am I correct in assuming that your selection procedure does not do this? If so you are probably using a procedure that is biased toward models with many parameters that may be poor models for prediction.