Solved – SVM: Number of support vectors

machine learningsvm

Imagine I am using an SVM to train a classifier for a given dataset, in one-vs-all configuration. For each class, I am performing cross validation for parameter selection (grid search to choose parameters for both SVM and kernel). I am also using the RBF kernel.

In many cases, what I am observing is that the number of support vectors for the positive class exactly matches (or is very close to) the total number of positive examples. The support vectors for the negative class vary, but is still quite high.

Thus, my questions are:

1) For a linear case, is it reasonable to expect that the number of support vectors from the positive set would be significantly less (given that the data is apparently not linearly seperable).

2) Does the number of support vectors selected reflect classification "capacity" in the sense that a high capacity classifier can separate highly overlapping classes with a complex decision boundary? Is it reasonable to assume that a higher number of support vectors is indicative of high capacity?

Best Answer

The proportion of support vectors is an upper bound on the leave-one-out cross-validation error (as the decision boundary is unaffected if you leave out a non-support vector), and thus provides an indication of the generalisation performance of the classifier. However, the bound isn't necessarily very tight (or even usefully tight), so you can have a model with lots of support vectors, but a low leave-one-out error (which appears to be the case here). There are tighter (approximate) bounds, such as the Span bound, which are more useful.

This commonly happens if you tune the hyper-parameters to optimise the CV error, you get a bland kernel and a small value of C (so the margin violations are not penalised very much), in which case the margin becomes very wide and there are lots of support vectors. Essentially both the kernel and regularisation parameters control capacity, and you can get a diagonal trough in the CV error as a function of the hyper-parameters because their effects are correlated and different combinations of kernel parameter and regularisation provide similarly good models.

It is worth noting that as soon as you tune the hyper-parameters, e.g. via CV, the SVM no longer implements a structural risk minimisation approach as we are just tuning the hyper-parameters directly on the data with no capacity control on the hyper-parameters. Essentially the performance estimates or bounds are biased or invalidated by their direct optimisation.

My advice would be to no worry about it and just be guided by the CV error (but remember that if you use CV to tune the model, you need to use nested CV to evaluate its performance). The sparsity of the SVM is a bonus, but I have found it doesn't generate sufficient sparsity to be really worthwhile (L1 regularisation provides greater sparsity). For small problems (e.g. 400 patterns) I use the LS-SVM, which is fully dense and generally performs similarly well.