Solved – Why is the bias term in SVM estimated separately, instead of an extra dimension in the feature vector

svmthreshold

The optimal hyperplane in SVM is defined as:

$$\mathbf w \cdot \mathbf x+b=0,$$

where $b$ represents threshold. If we have some mapping $\mathbf \phi$ which maps input space to some space $Z$, we can define SVM in the space $Z$, where the optimal hiperplane will be:

$$\mathbf w \cdot \mathbf \phi(\mathbf x)+b=0.$$

However, we can always define mapping $\phi$ so that $\phi_0(\mathbf x)=1$, $\forall \mathbf x$, and then the optimal hiperplane will be defined as
$$\mathbf w \cdot \mathbf \phi(\mathbf x)=0.$$

Questions:

  1. Why many papers use $\mathbf w \cdot \mathbf \phi(\mathbf x)+b=0$ when they already have mapping $\phi$ and estimate parameters $\mathbf w$ and theshold $b$ separatelly?

  2. Is there some problem to define SVM as
    $$\min_{\mathbf w} ||\mathbf w ||^2$$
    $$s.t. \ y_n \mathbf w \cdot \mathbf \phi(\mathbf x_n) \geq 1, \forall n$$ and estimate only parameter vector $\mathbf w$, assuming that we define $\phi_0(\mathbf x)=1, \forall\mathbf x$?

  3. If definition of SVM from question 2. is possible, we will have $\mathbf w = \sum_{n} y_n\alpha_n \phi(\mathbf x_n)$ and threshold will be simply $b=w_0$, which we will not treat separately. So we will never use formula like $b=t_n-\mathbf w\cdot \phi(\mathbf x_n)$ to estimate $b$ from some support vector $x_n$. Right?

Best Answer

Why bias is important?

The bias term $b$ is, indeed, a special parameter in SVM. Without it, the classifier will always go through the origin. So, SVM does not give you the separating hyperplane with the maximum margin if it does not happen to pass through the origin, unless you have a bias term.

Below is a visualization of the bias issue. An SVM trained with (without) a bias term is shown on the left (right). Even though both SVMs are trained on the same data, however, they look very different.

enter image description here

Why should the bias be treated separately?

As user logistic pointed out, the bias term $b$ should be treated separately because of regularization. SVM maximizes the margin size, which is $\frac{1}{||w||^2}$ (or $\frac{2}{||w||^2}$ depending on how you define it).

Maximizing the margin is the same as minimizing $||w||^2$. This is also called the regularization term and can be interpreted as a measure of the complexity of the classifier. However, you do not want to regularize the bias term because, the bias shifts the classification scores up or down by the same amount for all data points. In particular, the bias does not change the shape of the classifier or its margin size. Therefore, ...

the bias term in SVM should NOT be regularized.

In practice, however, it is easier to just push the bias into the feature vector instead of having to deal with as a special case.

Note: when pushing the bias to the feature function, it is best to fix that dimension of the feature vector to a large number, e.g. $\phi_0(x) = 10$, so as to minimize the side-effects of regularization of the bias.

Related Question