Solved – The relationship between the number of support vectors and the number of features

classificationkernel trickmachine learningsvm

I ran an SVM against a given data set, and made the following observation: If I change the number of features for building the classifier, the number of resulting support vectors will also be changed.

I would like to know how to explain this kind of scenario.

Best Answer

If you look at the optimization problem that SVM solves:

$\min_{\mathbf{w},\mathbf{\xi}, b } \left\{\frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right\}$

s.t. $y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i, ~~~~\xi_i \ge 0,$ for all $ i=1,\dots n$

the support vectors are those $x_i$ where the corresponding $\xi_i \gt 0$. In other words, they are the data points that are either misclassified, or close to the boundary.

Now let's compare the solution to this problem when you have a full set of features, to the case where you throw some features away. Throwing a feature away is functionally equivalent to keeping the feature, but adding a contraint $w_j=0$ for the feature $j$ that we want to discard.

When you compare these two optimization problems, and work through the math, it turns out there is no hard relationship between the number of features and the number of support vectors. It could go either way.

It's useful to think about a simple case. Imagine a 2-dim case where your negative and positive features are clustered around (-1,-1) and (1,1), respectively, and are separable with a diagonal separating hyperplane with 3 support vectors. Now imagine dropping the y-axis feature, so your data in now projected on the x-axis. If the data are still separable, say at x=0, you'd probably be left with only 2 support vectors, one on each side, so adding the y-feature would increases the number of support vectors. However, if the data are no longer separable, you'd get at least one support vector for each point that's on the wrong side of x=0, in which case adding the y-feature would reduce the number of support vectors.

So, if this intuition is correct, if you're working in very high-dimensional feature spaces, or using a kernel that maps to a high dimensional feature space, then your data is more likely to be separable, so adding a feature will tend to just add another support vector. Whereas if your data is not currently separable, and you add a feature that significantly improves separability, then you're more likely to see a decrease in the number of support vectors.

Related Question