Solved – SVM – number of dimensions greater than number of samples would give good or bad performance

scikit learnsvm

I was reading the documentation of sklearn SVM and saw these two statements

  • Still effective in cases where number of dimensions is greater than the number of samples
  • If the number of features is much greater than the number of samples, the method is likely to give poor performances.

Now my understanding regarding SVM is fairly limited (the reason why I am reading the article) but the above 2 statements seem contradictory. But I believe that I am missing something. What is missing in my understanding due to which the above 2 are not contradictory?

Best Answer

SVMs, like many other linear models, are based on empirical risk minimization, which leads us to an optimization of this sort:

$$\min_w\sum\ell(x_i, y_i, w)+\lambda\cdot r(w)$$

Where $\ell$ is a loss function (the Hinge-loss in SVMs) and $r$ is a regularization function.

The SVM is a squared $\ell_2$-regularized linear model, i.e. $r(w) = \|w\|^2_2$. This guards against huge coefficients, as one would say in regression terms, as the coefficient magnitudes are themselves penalized in the optimization.

Besides that, the regularization allows for a unique solution in the $p> n$ situation, so the 1st statement is true.

The problem when $p \gg n$ is that the bias introduced by regularization can be so large towards the training data the model heavily underperforms. It doesn't mean SVMs can't be used in that scenario (they are usually employed in gene expression data, for example, where $p$ can be thousands times larger than $n$).

So I see no contradiction in statements. The 2nd statement is more likely a warning against overfitting.