Solved – Why does SVM needs to keep support vectors

classificationsvm

I am reading the book Artificial Intelligence a Modern Approach and I have trouble understanding why the SVM needs to keep support vectors.

From the book:

SVMs are a nonparametric method — they retain training examples an
potentially need to store them all. On the other hand, in practice
they often end up retaining only a small fraction of the number of
examples
.

And then:

A final important property is that the weights associated with each data point are zero except for the support vectors — the points closest to the separator. Because there are usually many fewer support vectors than examples SVMs gain some of the advantages of parametric models.

Source: Artificial Intelligence a Modern Approach p746

As the SVMs separator is defined by a hyperplane w.x + b = 0 we only need to know w and b to make predictions. Why should it keep all the support vectors?

Best Answer

You are right that the SVM decision function $w \cdot x + b$ depends only on $w$ and $b$, however it can be shown that $w$ can be expressed as a sum of support vectors. You can consult Stanford's or MIT's course notes, pages 12 and 5 respectively, but basically it can be shown with optimization theory that the optimal weight vector $w^*$ can be written in the following form:

$$ w^*=\sum_{i=1}^{n}{\alpha_i^*y_ix_i} $$

where $\{(x_i,y_i)\}_i$ is your data $-$ $x_i$ are the attributes, $y_i$ the labels and $a_i$ Lagrangian coefficients. Further, it can be shown that only a fraction of the $\alpha_i$'s are non zero; vectors $(x_i,y_i)$ for which $\alpha_i>0$ are called support vectors, so that is why you need to store all of them and only them to make further predictions $-$ check page 13 of Stanford's course; check also page 7 of MIT's course. So if $N_{sv}$ is the number of support vectors, $\{(x_i^{sv},y_i^{sv})\}_i$ the support vectors and $\alpha_i^{sv}$ their corresponding alphas $-$ which are all strictly positive $-$ we can write the optimal $w^*$ as follows:

$$ w^*=\sum_{i=1}^{N_{sv}}{\alpha_i^{\,sv\,*}y_i^{\,sv}x_i^{\,sv}} $$

The optimal decision function is then:

$$ f_{SVM}^*(x) = \sum_{i=1}^{N_{sv}}{\alpha_i^{\,sv\,*}y_i^{\,sv}(x_i^{\,sv}\cdot x)}+b^* $$

Related Question