Solved – Use of nested cross-validation

cross-validationmachine learningscikit learn

Scikit Learn's page on Model Selection mentions the use of nested cross-validation:

>>> clf = GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas),
  ...                    n_jobs=-1)  
>>> cross_validation.cross_val_score(clf, X_digits, y_digits)

Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data.

From what I understand, clf.fit will use cross-validation natively to determine the best gamma. In that case, why would we need to use nested cv as given above? The note mentions that nested cv produces "unbiased estimates" of the prediction score. Isn't that also the case with clf.fit?

Also, I was unable to get the clf best estimates from the cross_validation.cross_val_score(clf, X_digits, y_digits) procedure. Could you please advise how that can be done?

Best Answer

Nested cross-validation is used to avoid optimistically biased estimates of performance that result from using the same cross-validation to set the values of the hyper-parameters of the model (e.g. the regularisation parameter, $C$, and kernel parameters of an SVM) and performance estimation. I wrote a paper on this topic after being rather alarmed by the magnitude of the bias introduced by a seemingly benign short cut often used in the evaluation of kernel machines. I investigated this topic in order to discover why my results were worse than other research groups using similar methods on the same datasets, the reason turned out to be that I was using nested cross-validation and hence didn't benefit from the optimistic bias.

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf)

The reasons for the bias with illustrative examples and experimental evaluation can be found in the paper, but essentially the point is that if the performance evaluation criterion is used in any way to make choices about the model, then those choices are based on (i) genuine improvements in generalisation performance and (ii) the statistical peculiarities of the particular sample of data on which the performance evaluation criterion is evaluated. In other words, the bias arises because it is possible (all too easy) to over-fit the cross-validation error when tuning the hyper-parameters.