Solved – SVM Hyperparameters Tuning

cross-validationhyperparametermachine learningoptimizationsvm

I am using SVM classifier to classify data,
My dataset consist of about 1 milion samples,

Currently im in the stage of tunning the machine ,
Try to find the best parameters including a suitable kernel (and kernel parameters),
also the regularization parameter (C) and tolerance (epsilon).

My corrent approach is using a blackbox global optimization algorithm to find the best parameter set, i use k-fold cross validation as a minimazation function,
The optimization algorithms i have in stock: Cma-ES, Simplex, HillClimbing, Down hill, GA and Simulated Annealing.

The problem is that the crossvalidation is a very slow process which makes all of the algorithms run for hours or even days..
and from what i know crossvalidation is the only option i have..

I want to know if it's possible to run the optimization on only a small part of the dataset to improve the runtime? , or are the parameters i will find will not work well on the full, larger dataset.

Also are the kernel parameters have any correlation with the C parameter? is it possible to firstly tune the C parameter and only after one was found continue and optimize the kernel parameters?

I know hyperparameter tuning is a very common issue so how is that im feeling there is no "clean" solution for this problem..
It must be a way that makes it possible for large datasets,

Ill appreciate any kind of help and advices

Best Answer

My experience with SVM does not include 1M datasets. I work usually up to 50K datasets. So caveat emptor.

1) there is no way to decouple gamma from C. I answered your other question on that Are the kernel parameters and the regularization parameter correlated in SVM?

2) There is no epsilon for classification. This is a hyperparameter for regression only

3) on datasets up to 50K I found that PSO work better than simplex, and SA. I hav not tried CMA or GA. There is no downhill or hill climbing unless you use some approximations to the leave one out error there is no closed expression for the gradient.

4) You don't need to use a low variance CV - I found that 2-fold is good enough. If you cannot afford the computational time of the 2-fold, than you could use a subsample, but in my experience that yields worse results.

5) IN MY EXPERIENCE, the error surface for the hyperparametrs is somewhat smooth - not convex, but smooth - there are no deep and narrow regions of low error that are worth spending a lot of computational time searching for them. Provided you are not selecting hyperparameters in a bad region of the error surface, there is no point in probing the surface too much. On that note, I would suggest a 5x5 grid search followed by another 5x5 grid around the minimum of the first grid - this should be enough. You end up probing the error surface 50 times, and you probably get a result as good as any black box optimization with the same limit on the number of probing. And it is much easily to parallelize.

Related Solutions

Solved – SVM parameter tuning for unbalanced classes (with class weights)

The call is ignoring the cost parameter because it isn't part of the list you passed to ranges. Your call should look like this:

tune.out <- tune(svm, RESPONSE~., data = train, kernel="radial", 
                 ranges = list(gamma=c(0.1,0.5,1,2,4), 
                               cost = c(0.1,1,10,100,1000)
                               ), 
                 class.weights= c("0" = 1, "1" = 10))

A similar examples is shown in the documentation (?tune) with the iris dataset.

obj <- tune(svm, Species~., data = iris, 
              ranges = list(gamma = 2^(-1:1), cost = 2^(2:4)),
              tunecontrol = tune.control(sampling = "fix")
             )

As for why it is taking so long I don't know how large your dataset is (it may just take a while to process it all) but a cost of 1000 is really high. Increasing the cost parameter makes the model more computationally expensive and also increases the risk of losing the ability to generalize your model. I would start with a lower sequence of cost parameters and keep checking to see if you performance continues to go up with increasing the cost parameter making sure to evaluate your model on an independent test set!!!

Machine Learning – Bonferroni Correction and Its Applications in Machine Learning

There is a degree to which what you are talking about with p-value correction is related, but there are some details that make the two cases very different. The big one is that in parameter selection there is no independence in the parameters you are evaluating or in the data you are evaluating them on. For ease of discussion, I will take choosing k in a K-Nearest-Neighbors regression model as an example, but the concept generalizes to other models as well.

Lets say we have a validation instance V that we are predicting to get an accuracy of the model in for various values of k in our sample. To do this we find the k = 1,...,n closest values in the training set which we will define as T₁, ... ,T_n. For our first value of k = 1 our prediction P1₁ will equal T₁, for k=2, prediction P₂ will be (T₁ + T₂)/2 or P₁/2 + T₂/2, for k=3 it will be (T₁ + T₂ + T₃)/3 or P₂*2/3 + T₃/3. In fact for any value k we can define the prediction P_k = P_k-1(k-1)/k + T_k/k. We see that the predictions are not independant of each other so therefore the accuracy of the predictions won't be either. In fact, we see that the value of the prediction is approaching the mean of the sample. As a result, in most cases testing values of k = 1:20 will select the same value of k as testing k = 1:10,000 unless the best fit you can get out of your model is just the mean of the data.

This is why it is ok to test a bunch of different parameters on your data without worrying too much about multiple hypothesis testing. Since the impact of the parameters on the prediction isn't random, your prediction accuracy is much less likely to get a good fit due solely to chance. You do have to worry about over fitting still, but that is a separate problem from multiple hypothesis testing.

To clarify the difference between multiple hypothesis testing and over fitting, this time we will imagine making a linear model. If we repeatedly resample data for to make our linear model (the multiple lines below) and evaluate it, on testing data (the dark points), by chance one of the lines will make a good model (the red line). This is not due to it actually being a great model, but rather that if you sample the data enough, some subset will work. The important thing to note here is that the accuracy looks good on the held out testing data because of all the models tested. In fact since we are picking the "best" model based on the testing data, the model may actually fit the testing data better than the training data.

Over fitting on the other hand is when you build a single model, but contort the parameters to allow the model to fit the training data beyond what is generalizeable. In the example below the the model (line) perfectly fits the training data (empty circles) but when evaluated on the testing data (filled circles) the fit is far worse.

Best Answer

Related Solutions

Solved – SVM parameter tuning for unbalanced classes (with class weights)

Machine Learning – Bonferroni Correction and Its Applications in Machine Learning

Related Question