Solved – Calculate number of support vectors in SVM

kernel trickmachine learningself-studysvm

I provided my current (partial) solution, and I hope someone can correct me and/or give me suggestions as to how I should solve the parts that I've left out.

Given an SVM with kernel:

$$K(x,z) = \theta(x)^T \theta(z) =
\left\{
\begin{array}{ll}
1 & \text{if } x = z \\
0 & \text{otherwise}
\end{array}
\right.$$

We are given $N$ training examples $(x_1, y_1) \ldots (x_N, y_N)$ with $y_i = \pm 1$. For simplicity, assume that the $x_i$'s are distinct, and that we only consider the SVM where the hyperplane in the feature space goes through the origin, i.e., the intercept $b=0$.

  1. Recall the weight vector $w$ used in SVM has the form
    $$w = \sum_{i=1}^N \alpha_i y_i \theta(x_i)$$
    Compute the $\alpha_i$'s explicitly that would be found using SVMs with this kernel.

  2. Recall that the SVM algorithm outputs a classifier that, on input $x$, computes the sign of $w^T \theta(x)$. What is the value of this inner product on the training example $x_i$? What is the value of this inner product on any example $x$ not seen during training? Based on these answers, what kind of generalization error do you expect will be achieved by SVMs using this kernel?

  3. Recall that the generalization error of SVMs can be bounded using the margin $\delta$ (which is equal to $1/\|w\|$), or using the number of support vectors. What is $\delta$ in this case? How many support vectors are there in this case? How are these answers consistent with your answer in part (2)?

For part (1), I maximized the dual form of the objective function
$$
\max_{\alpha} \sum_{n=1}^N \alpha_n – \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N y_n y_m \alpha_n \alpha_m K(x_n, x_m)
$$
subject to the constraints $\alpha_n \geq 0, n=1,\ldots,N$ and $\sum_{n=1}^N \alpha_n y_n = 0$. I did that by forming the Lagrangian
$$
\mathcal{L}(\alpha, \beta, \lambda) = \sum_{n=1}^N \alpha_n – \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N y_n y_m \alpha_n \alpha_m K(x_n, x_m) – \sum_{n=1}^N \beta_n \alpha_n – \lambda \sum_{n=1}^N \alpha_n y_n
$$
then taking the derivative with respect to $\alpha_i$, and setting it to zero.
The solution I got for $\alpha_i$ is
$$
\alpha_i = \frac{1}{y_i} \left[ \frac{1}{y_i} – \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}\right]
$$

For part (2), I started by replacing my $\alpha_i$ found in part (1) into the inner product
$$
w^T \theta(x) = \sum_{j=1}^N \alpha_j y_j \theta(x_j)^T \theta(x)
$$
When $x_i$ is in the training set, i get
$$
w^T \theta(x_i) = \frac{1}{y_i} – \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}
$$

When $x$ is not in the training set, i get
$$
w^T \theta(x) = 0
$$

Is it true that this shows that the kernel will yield a poor generalization performance? Is there any way I can show it formally?

Finally, for part (3), I got $\delta$ by
$$
\delta = \frac{1}{\|w\|} = \frac{1}{\sqrt{\sum_{n=1}^N \alpha_n^2 y_n^2}}
$$

How do I calculate the number of support vectors? And how does this relate to part (b)?

Best Answer

Notice that since $y_{i} = \pm 1$, you can rewrite, $$ \alpha_i = \frac{1}{y_i} \left[ \frac{1}{y_i} - \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}\right] = y_i \left[y_i - \frac{1}{N} \sum_{n=1}^N y_n\right] = 1 - y_{i}\frac{N^{+}-N^{-}}{N} $$ where $N^{+}$ and $N^{-}$ are the number of samples in each of the classes. You can check that $\sum_{n}\alpha_{n}y_{n} = 0$. Also $\alpha_{n} > 0$, that is, all vectors are support vectors.

As for the margin, $$ ||\omega|| = \sum_{n}\alpha^{2} = N\left[1-\left(\frac{N^{+}-N^{-}}{N}\right)^{2}\right] $$