Solved – SVM – number of dimensions greater than number of samples would give good or bad performance

scikit learnsvm

I was reading the documentation of sklearn SVM and saw these two statements

Still effective in cases where number of dimensions is greater than the number of samples
If the number of features is much greater than the number of samples, the method is likely to give poor performances.

Now my understanding regarding SVM is fairly limited (the reason why I am reading the article) but the above 2 statements seem contradictory. But I believe that I am missing something. What is missing in my understanding due to which the above 2 are not contradictory?

Best Answer

SVMs, like many other linear models, are based on empirical risk minimization, which leads us to an optimization of this sort:

$$\min_w\sum\ell(x_i, y_i, w)+\lambda\cdot r(w)$$

Where $\ell$ is a loss function (the Hinge-loss in SVMs) and $r$ is a regularization function.

The SVM is a squared $\ell_2$-regularized linear model, i.e. $r(w) = \|w\|^2_2$. This guards against huge coefficients, as one would say in regression terms, as the coefficient magnitudes are themselves penalized in the optimization.

Besides that, the regularization allows for a unique solution in the $p> n$ situation, so the 1st statement is true.

The problem when $p \gg n$ is that the bias introduced by regularization can be so large towards the training data the model heavily underperforms. It doesn't mean SVMs can't be used in that scenario (they are usually employed in gene expression data, for example, where $p$ can be thousands times larger than $n$).

So I see no contradiction in statements. The 2nd statement is more likely a warning against overfitting.

Related Solutions

Solved – SVM confusion matrix whose dimensions are more than two

Based on only the context we can get from the URL name, they are most likely using 15 SVMs, each of which tries to classify one class "versus" the rest (one-vs-all). Thus you need 15, and thus 15x15 confusion matrix.

You make a good point - more than one of these might declare a new point as their own. What then? A simple way is to use the hyperplane that is output as the model for the SVM and compare the distance to the plane in all +1 class labelings and accept the class for which the distance from the point to the plane is the greatest (ie, was most confidently classified).

The off-trace (incorrect) labelings the classifier makes are still done in the same way, ie,

C_ij = count(we declared i, but was actually j)

SVM – Why Scaling Features Can Decrease SVM Performance

The problem is that you used the default parameter values in both cases. Apparently, the default values happened to be better for your data set before scaling (this is a coincidence).

When using SVM, the parameters $c$ and $\gamma$ play a crucial role and it is your task to find the best values. Your intuition is correct: the optimal performance is better when all features are scaled properly (or at least 99.99% of the time). Unfortunately, neither of your settings had optimal parameters which led to a result that seemed to reject your intuition.

Searching the optimal values for $c$ and $\gamma$ is typically done via a grid search (e.g. search a set of $<c,\gamma>$ combinations). You can estimate the performance of an SVM for a given set of parameters using cross-validation.

In pseudo-code, the general idea is this:

for c in {set of possible c values}
    for gamma in {set of possible gamma values}
        perform k-fold cross-validation to find accuracy
    end
end
train svm model on full training set with best c,gamma-pair

You can find a good beginner's tutorial here.

Best Answer

Related Solutions

Solved – SVM confusion matrix whose dimensions are more than two

SVM – Why Scaling Features Can Decrease SVM Performance

Related Question