Consider the dual with no offset and not slack.
In the dual we have that for data points that are support vectors:
$$\alpha_t > 0$$
and that the constraint is satisfied with equality (since a support vector!):
$$y^{(t)}\sum^n_{t'=1} \alpha_{t'}y^{(t')}x^{(t')} \cdot x^{(t)} = 1$$
However, just because $\alpha_t > 0$ its not obvious to me if all the support vector have the same value of $\alpha_t$. Do all of the support vectors have the same value of $\alpha_t$? I was kind of looking for a mathematical justification or intuition for my question.
My guess is that they would have the same $\alpha_t$ value in the absence of slack.
My other guess was that, $\alpha_i$ intuitively, says the important of that point, right? (maybe thats wrong) If thats true then, not all support vector have the same value of $\alpha_i$ because not all support vectors are essential support vectors. What I mean is, that some support vectors will lie on the margin boundary but removing them will not necessarily change the margin (for example, consider the case where you have 3 SVs in a line, you can remove the middle one and keep the same boundary). So do essential support vectors get larger value of $\alpha_t$?
Also, I was wondering, what would happen in the case of slack and offset to the value of $\alpha_t$. I was guessing that closer points to the decision boundary, get larger value of $\alpha$? However, are points inside the margin even consider support vectors?
Best Answer
The cost function of SVM is: $$\min_{\alpha,\xi,b} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i$$
where $\xi$ are the slack variables (0 for hard margin) and $$ ||\mathbf{w}||^2 =\mathbf{w}^T\mathbf{w} = \sum_{i\in \mathcal{S}}\sum_{j\in \mathcal{S}} \alpha_i \alpha_j y_i y_j \kappa(\mathbf{x}_i,\mathbf{x}_j) $$
A couple of important properties can be derived from the Langrangian (here with slack variables):
$$L_p = \frac{1}{2}||\mathbf{w}||^2+C\sum_{i=1}^n\xi_i -\sum_{i=1}^n\alpha_i\Big[y_i\big(<\mathbf{w},\varphi(\mathbf{x}_i)>+b\big)-(1-\xi_i)\Big]-\sum_{i=1}^n\mu_i\xi_i. $$
By setting the partial derivative to $\xi$ (the slack variables) to zero, we obtain the following equation in $\alpha$: $$\frac{\partial L_p}{\partial \xi_i}=0 \quad \rightarrow \quad \alpha_i=C-\mu_i, \quad \forall i$$ In the presence of slack variables, all support values ($\alpha$'s) are bound by $0$ and $C$. They are not all equal.
In the absence of slack variables (hard margin SVM), $\alpha$ values are not equal because $\alpha$ is a direct part of the objective function as shown in the first two equations. In hard margin SVM, the cost function is a weighted quadratic function in $\alpha$ where the kernel evaluations between corresponding points determine the weights. Since not all weights are equal, the $\alpha$ values aren't either.
This is more apparant in least-squares SVM (LS-SVM), where all training points become support vectors due to the problem formulation. One of the original methods proposed to obtain sparsity in LS-SVM involved pruning support vectors with small $\alpha$ values.