Cross-validation – How to Apply Cross-Validation for Selecting SVM Parameters

cross-validationsvm

The wonderful libsvm package provides a python interface and a file "easy.py" that automatically searches for learning parameters (cost & gamma) that maximize the accuracy of the classifier. Within a given candidate set of learning parameters, accuracy is operationalized by cross-validation, but I feel like this undermines the purpose of cross-validation. That is, insofar as the learning parameters themselves can be chosen an a manner that might cause an over-fit of the data, I feel like a more appropriate approach would be to apply cross validation at the level of the search itself: perform the search on a training data set and then evaluate the ultimate accuracy of SVM resulting from the finally-chosen learning parameters by evaluation within a separate testing data set. Or am I missing something here?

Best Answer

If you learn the hyper-parameters in the full training data and then cross-validate, you will get an optimistically biased performance estimate, because the test data in each fold will already have been used in setting the hyper-parameters, so the hyper-parameters selected are selected in part because they suit the data in the test set. The optimistic bias introduced in this way can be unexpectedly large. See Cawley and Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", JMLR 11(Jul):2079−2107, 2010. (Particularly section 5.3). The best thing to do is nested cross-validation. The basic idea is that you cross-validate the entire method used to generate the model, so treat model selection (choosing the hyper-parameters) as simply part of the model fitting procedure (where the parameters are determined) and you can't go too far wrong.

If you use cross-validation on the training set to determine the hyper-parameters and then evaluate the performance of a model trained using those parameters on the whole training set, using a separate test set, that is also fine (provided you have enough data for reliably fitting the model and estimating performance using disjoint partitions).

Related Question