SVM – Should a Grid Search Show High-Accuracy Region with Surrounding Low Accuracies?

svm

I have 12 positive training sets (cancer cells treated with drugs with each of 12 different mechanisms of action). For each of these positive training sets, I would like to train a support-vector machine to distinguish it from a negative set of equal size sampled from the experiment. Each set has between 1000 and 6000 cells, and there are 476 features (image features) of each cell, each scaled linearly to [0, 1].

I use LIBSVM and the Gaussian RGB kernel. Using five-fold crossvalidation, I have done a grid search for log₂ C ∈ [-5, 15] and log₂ ɣ ∈ [-15, 3]. The results are as follows:

Results of grid search

I was disappointed that there is not a single set of parameters that give high accuracies for all 12 classification problems. I was also surprised that the grids do not generally show a high-accuracy region surrounded by lower accuracies. Does this just mean that I need to expand the search parameter space, or is the grid search an indication that something else is wrong?

Best Answer

The optimal values for the hyper-parameters will be different for different learning taks, you need to tune them separately for every problem.

The reason you don't get a single optimum is becuase both the kernel parameter and the regularisation parameter control the complexity of the model. If C is small you get a smooth model, likewise if the kernel with is broad, you will get a smooth model (as the basis functions are not very local). This means that different combinations of C and the kernel width lead to similarly complex models, with similar performance (which is why you get the diagonal feature in many of the plots you have).

The optimum also depends on the particular sampling of the training set. It is possible to over-fit the cross-validation error, so choosing the hyper-parameters by cross-validation can actually make performance worse if you are unlucky. See Cawley and Talbot for some discussion of this.

The fact that there is a broad plateau of values for the hyper-parameters where you get similarly good values is actually a good feature of support vector machines as it suggests that they are not overly vulnerable to over-fitting in model selection. If you had a sharp peak at the optimal values, that would be a bad thing as the peak would be difficult to find using a finite dataset which would provide an unreliable indication of where that peak actually resides.

Related Question