For a general kernel it is difficult to interpret the SVM weights, however for the linear SVM there actually is a useful interpretation:
1) Recall that in linear SVM, the result is a hyperplane that separates the classes as best as possible. The weights represent this hyperplane, by giving you the coordinates of a vector which is orthogonal to the hyperplane - these are the coefficients given by svm.coef_. Let's call this vector w.
2) What can we do with this vector? It's direction gives us the predicted class, so if you take the dot product of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.
3) Finally, you can even learn something about the importance of each feature. This is my own interpretation so convince yourself first. Let's say the svm would find only one feature useful for separating the data, then the hyperplane would be orthogonal to that axis. So, you could say that the absolute size of the coefficient relative to the other ones gives an indication of how important the feature was for the separation. For example if only the first coordinate is used for separation, w will be of the form (x,0) where x is some non zero number and then |x|>0.
From section 7.10.2 of Elements of Statistical Learning(free online, and it's great):
Consider a classification problem with a large number of predictors, as may
arise, for example, in genomic or proteomic applications. A typical strategy
for analysis might be as follows:
- Screen the predictors: find a subset of “good” predictors that show
fairly strong (univariate) correlation with the class labels
- Using just this subset of predictors, build a multivariate classifier.
- Use cross-validation to estimate the unknown tuning parameters and
to estimate the prediction error of the final model.
Is this a correct application of cross-validation? Consider a scenario with
N = 50 samples in two equal-sized classes, and p = 5000 quantitative
predictors (standard Gaussian) that are independent of the class labels.
The true (test) error rate of any classifier is 50%. We carried out the above
recipe, choosing in step (1) the 100 predictors having highest correlation
with the class labels, and then using a 1-nearest neighbor classifier, based
on just these 100 predictors, in step (2). Over 50 simulations from this
setting, the average CV error rate was 3%. This is far lower than the true
error rate of 50%.
What has happened? The problem is that the predictors have an unfair
advantage, as they were chosen in step (1) on the basis of all of the samples.
Leaving samples out after the variables have been selected does not cor-rectly mimic the application of the classifier to a completely independent
test set, since these predictors “have already seen” the left out samples.
We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors
with the class labels over just these 10 samples (top panel). We see that
the correlations average about 0.28, rather than 0, as one might expect
Best Answer
I realize this is a super old question, but I ran into this same thing today, and found this document. Section 7.3, which describes shrinkage as implemented in libSVM (around which sklearn's SVM is a wrapper), begins with the following useful blurb:
So basically libSVM optimizes over a subset of the Lagrange multipliers α_i. As many of these are typically zero in a given problem, this is often a safe heuristic to adopt. Much more information can be found in the linked document.