[Math] How to compute the offset parameter for an SVM from the dual solution

machine learning

If the plane equation for an SVM is:

$$\theta \cdot x^{(i)} + \theta_0$$

How do you compute $\theta_0$ from the dual solution?

What I have so far is, for every support vector (SV) $x^{(t)}$ we have:

$$y^{(t)} (\theta \cdot x^{(t)} + \theta_0) = y^{(t)}(\sum^n_{t'=1}\alpha_t y^{(t')} x^{(t')} \cdot x^{(t)} + \theta_0) = 1$$

multiply both sides by $y^{(t)}$ for the given SV:

$$\theta \cdot x^{(t)} + \theta_0 = \sum^n_{t'=1}\alpha_t y^{(t')} x^{(t')} \cdot x^{(t)} + \theta_0 = y^{(t)}$$

Now we get $\theta_0$:

$$\theta_0 = y^{(t)} – \sum^n_{t'=1}\alpha_t y^{(t')} x^{(t')} \cdot x^{(t)} $$

But what I found a little strange is that we only need any support vector. Like, each y is different for each SV andthe value of $\theta = \sum^n_{t'=1}\alpha_t y^{(t')} x^{(t')} $ should be the same for every SV. How come the different values of y and different value of x, still give the same $\theta_0$? Is my formulation even right? What I have might be right, it just seemed a little odd/counter-intuitive.

Best Answer

Your formulation is right for linear separable case. For cases that no hyper-planes can separate training data perfectly, slack variables are compulsory.

Indeed we can get the offset from any of the support vectors based on complementary slackness. For practical concern, we usually take the average on all support vectors for better numerical stability, e.g. $$ \theta_0 = \frac{1}{N_{\mathcal{S}}}\sum_{t \in \mathcal{S}} \left( y^{(t)} - \sum_{t'=1}^{n}\alpha_t y^{(t')}x^{(t')^\intercal}x^{(t)} \right) $$