Machine Learning – Cross Validation Error Generalization After Model Selection

cross-validationdata miningmachine learningmodel selection

Note: Case is n>>p

I am reading Elements of Statistical Learning and there are various mentions about the "right" way to do cross validation( e.g. page 60, page 245). Specifically, my question is how to evaluate the final model (without a separate test set) using k-fold CV or bootstrapping when there has been a model search? It seems that in most cases (ML algorithms without embedded feature selection) there will be

  1. A feature selection step
  2. A meta parameter selection step (e.g. the cost parameter in SVM).

My Questions:

  1. I have seen that the feature selection step can be done where feature selection is done on the whole training set and held aside. Then, using k-fold CV, the feature selection algorithm is used in each fold (getting different features possibly chosen each time) and the error averaged. Then, you would use the features chosen using all the data (that were set aside) to train the final mode, but use the error from the cross validation as an estimate of future performance of the model. IS THIS CORRECT?
  2. When you are using cross validation to select model parameters, then how to estimate model performance afterwards? IS IT THE SAME PROCESS AS #1 ABOVE OR SHOULD YOU USE NESTED CV LIKE SHOWN ON PAGE 54 (pdf) OR SOMETHING ELSE?
  3. When you are doing both steps (feature and parameter setting)…..then what do you do? complex nested loops?
  4. If you have a separate hold out sample, does the concern go away and you can use cross validation to select features and parameters (without worry since your performance estimate will come from a hold out set)?

Best Answer

The key thing to remember is that for cross-validation to give an (almost) unbiased performance estimate every step involved in fitting the model must also be performed independently in each fold of the cross-validation procedure. The best thing to do is to view feature selection, meta/hyper-parameter setting and optimising the parameters as integral parts of model fitting and never do any one of these steps without doing the other two.

The optimistic bias that can be introduced by departing from that recipe can be surprisingly large, as demonstrated by Cawley and Talbot, where the bias introduced by an apparently benign departure was larger than the difference in performance between competing classifiers. Worse still biased protocols favours bad models most strongly, as they are more sensitive to the tuning of hyper-parameters and hence are more prone to over-fitting the model selection criterion!

Answers to specific questions:

The procedure in step 1 is valid because feature selection is performed separately in each fold, so what you are cross-validating is whole procedure used to fit the final model. The cross-validation estimate will have a slight pessimistic bias as the dataset for each fold is slightly smaller than the whole dataset used for the final model.

For 2, as cross-validation is used to select the model parameters then you need to repeat that procedure independently in each fold of the cross-validation used for performance estimation, you you end up with nested cross-validation.

For 3, essentially, yes you need to do nested-nested cross-validation. Essentially you need to repeat in each fold of the outermost cross-validation (used for performance estimation) everything you intend to do to to fit the final model.

For 4 - yes, if you have a separate hold-out set, then that will give an unbiased estimate of performance without needing an additional cross-validation.

Related Question