Solved – SVM parameter selection and model testing with cross-validation

cross-validationsvm

I've read:

but I still don't get it.

My problem is to construct a simple SVM, tune it's parameters and calculate the generalization error. Assume I have a dataset with e.g. 10 features and 200 samples. I don't want to waste data, since it's relatively small. My approach would be:

  1. Split the dataset (e.g., 70/30) with holdout method into training / validation and test set.

  2. Make a repeated n-fold cross-validation on the training set. I calculate the error rate (misclassification rate) after the folds are done (simply counting all misclassified samples). I try to minimize the error rate (or any other loss function?) each time the complete n fold is done. I store the error rate and choose the parameters with the lowest error rate. Then I retrain the model on the complete training set with the estimated parameters.

  3. Calculate generalization error with the test set.

The problems:

  • The larger my test set is, the smaller gets the train set, so I discard potential information. Can this be solved via a "stacked" n-fold cv?
  • Do I really have to make a REPEATED n-fold cv? Are there other possibilities?
  • Is the error rate an appreciate loss function or should I choose another one (eg. the empirical error function or mse, but then I'd need a probability output, right?)?

Best Answer

The larger my test set is, the smaller gets the train set, so I discard potential information. Can this be solved via a "stacked" n-fold cv?

Yes. It is usually called nested or double cross validation, and we have a number of questions and answers about that. You could start e.g. with Nested cross validation for model selection

Do I really have to make a REPEATED n-fold cv? Are there other possibilities?

Repetitions / iterations in resampling validation help only if the (surrogate) models are unstable. If you are really sure your models are stable (but how can you be when having concerns about small sample size?) then you don't need the iterations / repetitions. OTOH, IMHO the easiest way to prove that the models are stable is running a few iterations and look at the stability of the predictions.

Is the error rate an appreciate loss function or should I choose another one (eg. the empirical error function or MSE, but then I'd need a probability output, right?)?

No, overall error rate is not a very good loss function, particularly not for optimization. MSE is much better, it is a proper scoring rule. Yes, proper scoring rules need probability output.
However, SVM are anyways quite ugly to optimize as they do not react continuously to small continuous changes in the training data + hyperparameters: up to a certain limit nothing changes (i.e. the same cases stay support vectors), then suddenly the support vectors change.

See also