Solved – Deriving the intercept term in a linearly separable and soft-margin SVM

interceptmachine learningsvm

I have read Andrew Ng lecture notes on Support Vector Machines as well as the notes from MIT OpenCourseWare and I have a few doubts concerning the derivation of the intercept value:

  • First, there is a discrepancy in the notes: Ng states that $$b=-\displaystyle\frac{\max_{i:y_i=-1}w^{*T}x_i+\min_{i:y_i=1}w^{*T}x_i}{2},$$ whereas in the MIT notes it is said that $b=1-\min_{i:y_i=1}w^{*T}x_i$.
  • Second, I might understand the principle of this approach, given that in the linearly separable case all support vectors are situated at the same distance from the margin; however, how to determine the intercept $b$ when we introduce soft margin's slack variables $\xi=(\xi_1,…,\xi_n)$? In Burges' tutorial on SVM (A Tutorial on Support Vector Machines for Pattern
    Recognition
    ) on bottom of page 10, he says that:

However $b$ is easily found by using the KKT “complementarity”
condition, Eq. (21), by choosing any $i$ for which $\alpha_i = 0$ and
computing $b$ (note that it is numerically safer to take the mean
value of $b$ resulting from all such equations).

I understand Burges' approach, however the paper dates back to 1998 and this might not be the standard approach anymore.

Can anyone explain this discrepancy between the two set of notes? Does anyone know how to determine $b$ in the presence of soft margins?

Best Answer

The two formulas give the same result.

SVM assumes:
$wx+b=0$              (the decision boundary
$wx_{sp}+b=1$           ($x_{sp}$ is a support vector with $y=1$      (eq.1
$wx_{sn}+b=-1$       ($x_{sn}$ is a support vector with $y=-1$   (eq.2

By eq.1, $b=1-wx_{sp}$, which is MIT's notation.
By eq.2, $b=-1-wx_{sn}$, adding up both sides yields $2b=-wx_{sp}-wx_{sn}$, then $b=-\frac{wx_{sp}+wx_{sn}}{2}$, which is Andrew Ng's notation.

Andrew Ng's notation is more numerically stable because its is the average of two points (though theoretically there should be no difference), an even more stable way is to take the average of all the support vectors as you mentioned.

In the soft margin case, not all the support vectors lie on the margin (i.e. not all of them satisfy eq.1 or eq.2). We need to take out the vectors that lie on the margin (i.e. support vectors that satisfy $0<a<C$ in both MIT and Ng's slides) and do the same thing to compute the intercept term.

Related Question