Machine Learning – Do Support Vectors Share the Same Alpha Value in SVM?

machine learningsvm

Consider the dual with no offset and not slack.

In the dual we have that for data points that are support vectors:

$$\alpha_t > 0$$

and that the constraint is satisfied with equality (since a support vector!):

$$y^{(t)}\sum^n_{t'=1} \alpha_{t'}y^{(t')}x^{(t')} \cdot x^{(t)} = 1$$

However, just because $\alpha_t > 0$ its not obvious to me if all the support vector have the same value of $\alpha_t$. Do all of the support vectors have the same value of $\alpha_t$? I was kind of looking for a mathematical justification or intuition for my question.
My guess is that they would have the same $\alpha_t$ value in the absence of slack.
My other guess was that, $\alpha_i$ intuitively, says the important of that point, right? (maybe thats wrong) If thats true then, not all support vector have the same value of $\alpha_i$ because not all support vectors are essential support vectors. What I mean is, that some support vectors will lie on the margin boundary but removing them will not necessarily change the margin (for example, consider the case where you have 3 SVs in a line, you can remove the middle one and keep the same boundary). So do essential support vectors get larger value of $\alpha_t$?

Also, I was wondering, what would happen in the case of slack and offset to the value of $\alpha_t$. I was guessing that closer points to the decision boundary, get larger value of $\alpha$? However, are points inside the margin even consider support vectors?

Best Answer

The cost function of SVM is: $$\min_{\alpha,\xi,b} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i$$

where $\xi$ are the slack variables (0 for hard margin) and $$ ||\mathbf{w}||^2 =\mathbf{w}^T\mathbf{w} = \sum_{i\in \mathcal{S}}\sum_{j\in \mathcal{S}} \alpha_i \alpha_j y_i y_j \kappa(\mathbf{x}_i,\mathbf{x}_j) $$

A couple of important properties can be derived from the Langrangian (here with slack variables):

$$L_p = \frac{1}{2}||\mathbf{w}||^2+C\sum_{i=1}^n\xi_i -\sum_{i=1}^n\alpha_i\Big[y_i\big(<\mathbf{w},\varphi(\mathbf{x}_i)>+b\big)-(1-\xi_i)\Big]-\sum_{i=1}^n\mu_i\xi_i. $$

By setting the partial derivative to $\xi$ (the slack variables) to zero, we obtain the following equation in $\alpha$: $$\frac{\partial L_p}{\partial \xi_i}=0 \quad \rightarrow \quad \alpha_i=C-\mu_i, \quad \forall i$$ In the presence of slack variables, all support values ($\alpha$'s) are bound by $0$ and $C$. They are not all equal.

In the absence of slack variables (hard margin SVM), $\alpha$ values are not equal because $\alpha$ is a direct part of the objective function as shown in the first two equations. In hard margin SVM, the cost function is a weighted quadratic function in $\alpha$ where the kernel evaluations between corresponding points determine the weights. Since not all weights are equal, the $\alpha$ values aren't either.

This is more apparant in least-squares SVM (LS-SVM), where all training points become support vectors due to the problem formulation. One of the original methods proposed to obtain sparsity in LS-SVM involved pruning support vectors with small $\alpha$ values.