Solved – Grid search for SVM parameters; is this is really how it is done

cross-validationfeature selectionsvm

Suppose I use nested 10-fold cross-validation with SVM. So, the inner-most loop will go around 100 times. Now, suppose I use a gaussian radial basis kernel function, which needs the parameter sigma. Moreover, I need to find the optimal C parameter for SVM (it is called 'boxconstraint' in Matlab). To find the optimal (sigma, C) pair, I would need to train 100*100 = 10000 SVMs for each fold. So in the end there would be about one million trained SVMs (performance estimation+parameter selection). Is this really done like this?

What if I add feature selection. I might have to train and test 10 million SVMs. Should I apply for a super computer? What's the limit of an ordinary 8-core machine in terms of doing this kind selection?

Best Answer

Yes, SVM hyper-parameters are often trained using grid search, and it is a pretty reasonable way to go about model selection provided you only have a few hyper-parameters to optimise. It isn't quite as expensive as the question suggests because if you change the hyper-parameters slightly, you don't have to start the training procedure (e.g. SMO) from scratch each time, instead you can start from the optimal solution from the last set of hyper-parameters you looked at ("warm start"). If you evaluate the grid so that each row represents different values of C and each column different values of the RBF scale parameter, then if you optimise each row in turn, then you can even keep the cached evaluations of the kernel function and only need to flush the cache at the start of each column. Alternatively you can use an algorithm that evaluates the whole "regularisation path" (i.e. a whole row) in one go.

Grid search is impractical for problems with a large number of hyper-parameters, so I use the Nelder-Mead simplex algorithm, that can find the solution of minimisation procedures without evaluating gradient information (which works pretty well). Alternatively you can use gradient descent optimisation (optimising a bound on the leave-one-out cross-validation error is also a good alternative to regular cross-validation).

I would generally recommend against feature selection for SVMs (unless identifying the relevant features is of intrinsic interest) as it often makes performance worse rather than better. The SVM is an approximate implementation of a bound on generalisation performance which is independent of dimensionality, so there is good reason to think they will work well without the need for feature selection (provided C is carefully tuned).

I do this kind of thing all the time, and it is computationally expensive. However the key to using kernel methods correctly lies in rigorous tuning of the kernel and hyper-parameters, so the expense is well justified as it is what is required for "best practice".

Related Question