Solved – SVM Hyperparameters Tuning

cross-validationhyperparametermachine learningoptimizationsvm

I am using SVM classifier to classify data,
My dataset consist of about 1 milion samples,

Currently im in the stage of tunning the machine ,
Try to find the best parameters including a suitable kernel (and kernel parameters),
also the regularization parameter (C) and tolerance (epsilon).

My corrent approach is using a blackbox global optimization algorithm to find the best parameter set, i use k-fold cross validation as a minimazation function,
The optimization algorithms i have in stock: Cma-ES, Simplex, HillClimbing, Down hill, GA and Simulated Annealing.

The problem is that the crossvalidation is a very slow process which makes all of the algorithms run for hours or even days..
and from what i know crossvalidation is the only option i have..

I want to know if it's possible to run the optimization on only a small part of the dataset to improve the runtime? , or are the parameters i will find will not work well on the full, larger dataset.

Also are the kernel parameters have any correlation with the C parameter? is it possible to firstly tune the C parameter and only after one was found continue and optimize the kernel parameters?

I know hyperparameter tuning is a very common issue so how is that im feeling there is no "clean" solution for this problem..
It must be a way that makes it possible for large datasets,

Ill appreciate any kind of help and advices

Best Answer

My experience with SVM does not include 1M datasets. I work usually up to 50K datasets. So caveat emptor.

1) there is no way to decouple gamma from C. I answered your other question on that Are the kernel parameters and the regularization parameter correlated in SVM?

2) There is no epsilon for classification. This is a hyperparameter for regression only

3) on datasets up to 50K I found that PSO work better than simplex, and SA. I hav not tried CMA or GA. There is no downhill or hill climbing unless you use some approximations to the leave one out error there is no closed expression for the gradient.

4) You don't need to use a low variance CV - I found that 2-fold is good enough. If you cannot afford the computational time of the 2-fold, than you could use a subsample, but in my experience that yields worse results.

5) IN MY EXPERIENCE, the error surface for the hyperparametrs is somewhat smooth - not convex, but smooth - there are no deep and narrow regions of low error that are worth spending a lot of computational time searching for them. Provided you are not selecting hyperparameters in a bad region of the error surface, there is no point in probing the surface too much. On that note, I would suggest a 5x5 grid search followed by another 5x5 grid around the minimum of the first grid - this should be enough. You end up probing the error surface 50 times, and you probably get a result as good as any black box optimization with the same limit on the number of probing. And it is much easily to parallelize.

Related Question