Solved – Calculate number of support vectors in SVM

kernel trickmachine learningself-studysvm

I provided my current (partial) solution, and I hope someone can correct me and/or give me suggestions as to how I should solve the parts that I've left out.

Given an SVM with kernel:

$$K(x,z) = \theta(x)^T \theta(z) =
\left\{
\begin{array}{ll}
1 & \text{if } x = z \\
0 & \text{otherwise}
\end{array}
\right.$$

We are given $N$ training examples $(x_1, y_1) \ldots (x_N, y_N)$ with $y_i = \pm 1$. For simplicity, assume that the $x_i$'s are distinct, and that we only consider the SVM where the hyperplane in the feature space goes through the origin, i.e., the intercept $b=0$.

Recall the weight vector $w$ used in SVM has the form
$$w = \sum_{i=1}^N \alpha_i y_i \theta(x_i)$$
Compute the $\alpha_i$'s explicitly that would be found using SVMs with this kernel.

Recall that the SVM algorithm outputs a classifier that, on input $x$, computes the sign of $w^T \theta(x)$. What is the value of this inner product on the training example $x_i$? What is the value of this inner product on any example $x$ not seen during training? Based on these answers, what kind of generalization error do you expect will be achieved by SVMs using this kernel?

Recall that the generalization error of SVMs can be bounded using the margin $\delta$ (which is equal to $1/\|w\|$), or using the number of support vectors. What is $\delta$ in this case? How many support vectors are there in this case? How are these answers consistent with your answer in part (2)?

For part (1), I maximized the dual form of the objective function
$$
\max_{\alpha} \sum_{n=1}^N \alpha_n – \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N y_n y_m \alpha_n \alpha_m K(x_n, x_m)
$$
subject to the constraints $\alpha_n \geq 0, n=1,\ldots,N$ and $\sum_{n=1}^N \alpha_n y_n = 0$. I did that by forming the Lagrangian
$$
\mathcal{L}(\alpha, \beta, \lambda) = \sum_{n=1}^N \alpha_n – \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N y_n y_m \alpha_n \alpha_m K(x_n, x_m) – \sum_{n=1}^N \beta_n \alpha_n – \lambda \sum_{n=1}^N \alpha_n y_n
$$
then taking the derivative with respect to $\alpha_i$, and setting it to zero.
The solution I got for $\alpha_i$ is
$$
\alpha_i = \frac{1}{y_i} \left[ \frac{1}{y_i} – \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}\right]
$$

For part (2), I started by replacing my $\alpha_i$ found in part (1) into the inner product
$$
w^T \theta(x) = \sum_{j=1}^N \alpha_j y_j \theta(x_j)^T \theta(x)
$$
When $x_i$ is in the training set, i get
$$
w^T \theta(x_i) = \frac{1}{y_i} – \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}
$$

When $x$ is not in the training set, i get
$$
w^T \theta(x) = 0
$$

Is it true that this shows that the kernel will yield a poor generalization performance? Is there any way I can show it formally?

Finally, for part (3), I got $\delta$ by
$$
\delta = \frac{1}{\|w\|} = \frac{1}{\sqrt{\sum_{n=1}^N \alpha_n^2 y_n^2}}
$$

How do I calculate the number of support vectors? And how does this relate to part (b)?

Best Answer

Notice that since $y_{i} = \pm 1$, you can rewrite, $$ \alpha_i = \frac{1}{y_i} \left[ \frac{1}{y_i} - \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}\right] = y_i \left[y_i - \frac{1}{N} \sum_{n=1}^N y_n\right] = 1 - y_{i}\frac{N^{+}-N^{-}}{N} $$ where $N^{+}$ and $N^{-}$ are the number of samples in each of the classes. You can check that $\sum_{n}\alpha_{n}y_{n} = 0$. Also $\alpha_{n} > 0$, that is, all vectors are support vectors.

As for the margin, $$ ||\omega|| = \sum_{n}\alpha^{2} = N\left[1-\left(\frac{N^{+}-N^{-}}{N}\right)^{2}\right] $$

Related Solutions

Solved – Non-linear SVM classification with RBF kernel

Let $\mathcal{X}$ represent your input space i.e the space where your data points resides. Consider a function $\Phi:\mathcal{X} \rightarrow \mathcal{F}$ such that it takes a point from your input space $\mathcal{X}$ and maps it to a point in $\mathcal{F}$. Now, let us say that we have mapped all your data points from $\mathcal{X}$ to this new space $\mathcal{F}$. Now, if you try to solve the normal linear svm in this new space $\mathcal{F}$ instead of $\mathcal{X}$, you will notice that all the earlier working simply look the same, except that all the points $x_i$ are represented as $\Phi(x_i)$ and instead of using $x^Ty$ (dot product) which is the natural inner product for Euclidean space, we replace it with $\langle \Phi(x), \Phi(y) \rangle$ which represents the natural inner product in the new space $\mathcal{F}$. So, at the end, your $w^*$ would look like,

$$ w^*=\sum_{i \in SV} h_i y_i \Phi(x_i) $$

and hence, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i \langle \Phi(x_i), \Phi(x) \rangle $$

Similarly, $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j \langle \Phi(x_j), \Phi(x_i)\rangle\right)\right) $$

and your classification rule looks like: $c_x=\text{sign}(\langle w, \Phi(x) \rangle+b)$.

So far so good, there is nothing new, since we have simply applied the normal linear SVM to just a different space. However, the magic part is this -

Let us say that there exists a function $k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ such that $k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$. Then, we can replace all the dot products above with $k(x_i, x_j)$. Such a $k$ is called a kernel function.

Therefore, your $w^*$ and $b^*$ look like, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i k(x_i, x) $$ $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j k(x_j, x_i)\right)\right) $$

For which kernel functions is the above substitution valid? Well, that's a slightly involved question and you might want to take up proper reading material to understand those implications. However, I will just add that the above holds true for RBF Kernel.

To answer your question, "Is the situation so that all the support vectors are needed for the classification?" Yes. As you may notice above, we compute the inner product of $w$ with $x$ instead of computing $w$ explicitly. This requires us to retain all the support vectors for classification.

Note: The $h_i$'s in the final section here are solution to dual of the SVM in the space $\mathcal{F}$ and not $\mathcal{X}$. Does that mean that we need to know $\Phi$ function explicitly? Luckily, no. If you look at the dual objective, it consists only of inner product and since we have $k$ that allows us to compute the inner product directly, we don't need to know $\Phi$ explicitly. The dual objective simply looks like, $$ \max \sum_i h_i - \sum_{i,j} y_i y_j h_i h_j k(x_i, x_j) \\ \text{subject to : } \sum_i y_i h_i = 0, h_i \geq 0 $$

Solved – Plotting the decision boundary of a kernel SVM (RBF)

I figured out what is needed to be done. Actually, it's something simple, but its seems I had a matlaboid bug... Here is the code and the resulting figure for the "XOR" binary classification problem.

gamma     = getGamma();
b         = getB();
points_x1 = linspace(xLimits(1), xLimits(2), 100);
points_x2 = linspace(yLimits(1), yLimits(2), 100);
[X1, X2]  = meshgrid(points_x1, points_x2);

% Initialize f
f = ones(length(points_x1), length(points_x2))*rho;

% Iter. all SVs
for i=1:N_sv
    alpha_i = getAlpha(i);
    sv_i    = getSV(i);
    for j=1:length(points_x1)
        for k=1:length(points_x2)
            x = [points_x1(j);points_x2(k)];
            f(j,k) = f(j,k) + alpha_i*y_i*kernel_func(gamma, x, sv_i);
        end
    end    
end

surf(X1,X2,f);
shading interp;
lighting phong;
alpha(.6)

contourf(X1, X2, f, 1);

where the function

function k = kernel_func(gamma, x, x_i)
    k = exp(-gamma*norm(x - x_i)^2);
end

just produces the kernel function (RBF kernel), $k(\mathbf{x},\mathbf{x}')=\operatorname{exp}\left(-\gamma\lVert\mathbf{x}-\mathbf{x}'\rVert^2\right)$.

Here is the result for the XOR problem. Here $\gamma=4$.

enter image description here

Best Answer

Related Solutions

Solved – Non-linear SVM classification with RBF kernel

Solved – Plotting the decision boundary of a kernel SVM (RBF)

Related Question