[Math] Hyperplanes and Support Vector Machines

geometrymachine learning

I have the following question regarding support vector machines: So we are given a set of
training points $\{x_i\}$ and a set of binary labels $\{y_i\}$.

Now usually the hyperplane classifying the points is defined as:

$ w \cdot x + b = 0 $

First question: Here $x$ does not denote the points of the training set, but the points on the separating (hyper)plane, right?

In a next step, the lecture notes state that the function:

$f(x) = \mathrm{sign}\, (w \cdot x + b)$

correctly classify the training data.
Second question: Now I don't understand that, since it was stated earlier that $ w \cdot x + b = 0 $, so how can something which is defined to be zero have a sign?

Two additonal questions:

(1) You have added that a slack variable might be introduced for non-linearly separable data – how do we know that the data is not linearly separable, as far as I understand the mapping via kernel has as a purpose to map the data into a vector space where in the optimal case it would be linearly separable (and why not using a non-linear discriminant function altogether instead of introducing a slack variable?)

(2) I've seen that for the optimization only one of the inequalities $w \cdot x = b \geqslant 1$ is being used as a linear constrained – why?

Best Answer

I suggest to you to thoroughly study a really insightful tutorial about Support Vector Machines for Pattern Recognition due to C. Burges. It isn't elementary, but it gives significant information about SV learning. It really helped me as a beginner!

Concerning your first question. $\mathcal{H}\colon\mathbf{w}\cdot\mathcal{x}+b=0$ describes the locus of points belonging to hyperplane $\mathcal{H}$. $\mathbf{x}$ denotes an arbitrary vector in your input space, that is, the space where your training samples (i.e. $\mathbf{x}_i$, $i=1,\ldots,l$) live. Often this is some subset of $\mathbb{R}^n$.

As for your second question, granted that you have learned some separating hyperplane, $\mathcal{H}\colon\mathbf{w}\cdot\mathcal{x}+b=0$ (and this is exactly what a linear SVM does), then given some unseen (that is, a new, without truth label) datum $\mathbf{x}_t$, you have to classify it to some class, by giving it a truth label $y_t\in\{\pm1\}$. The way you do so is by checking the sign of the decision function $f(\mathbf{x})=\mathbf{w}\cdot\mathcal{x}+b$, that is, $y_t=\operatorname{sgn}(f(\mathbf{x}_t))$. $\operatorname{sgn}(f(\mathbf{x}_t))>0$ means that your datum is "above" the separating hyperplane and thus has to be classified as positive, while $\operatorname{sgn}(f(\mathbf{x}_t))<0$ means that your datum is "under" the separating hyperplane and thus has to be classified as negative. If $\operatorname{sgn}(f(\mathbf{x}_t))=0$ it means that you testing sample belongs to the separating hyperplane, and thus there cannot be some decision about its truth label.

Note that if $-1<f(\mathbf{x}_t)<+1$ then the testing datum lies inside the so-called margin of the classifier (take a look at the tutorial above for more information).

EDIT

I would like to add some comments below motivated by the discussion between @Pegah and me in the comments.

I guess because in other contexts the vectors $\mathbf{x}$ in the equation for hyperplanes refer to the points on the hyperplane, but here $\mathbf{x}$ refer to input vectors which are not necessarily on it, which seemed ambiguous to me.

This is not correct, actually. $\mathbf{x}\in\mathbb{R}^n$ refers to any point in the Euclidean $n$-dimensional space. Now, if and only if such a point belongs to the hyperplane $\mathcal{H}$, it satisfies the equation $\mathbf{w}\cdot\mathcal{x}+b=0$. In other words, the hyperplane $\mathcal{H}$ is the locus of the $n$-dimensional points that satisfy $\mathbf{w}\cdot\mathcal{x}+b=0$. For all other point it must hold that $\mathbf{w}\cdot\mathcal{x}+b>0$ or $\mathbf{w}\cdot\mathcal{x}+b<0$.

I have another question though if you don't mind: I do understand that we can label new input vectors with $-1$ and $+1$, but why (as in the script linked by you) do we assume that $\mathbf{w}\cdot\mathcal{x}+b\leq-1$ and $\mathbf{w}\cdot\mathcal{x}+b\geq+1$ respectively? I don't see that explained anywhere, and am wondering why it could not be a and $-a$ for some $a\in\mathbb{R}$. Is it simply scaling (and are hence both sides of the inequality scaled, i.e. divided by $a$?).

The above inequalities are our constraints to state that the training data are linearly separable, which -of course- is a very strict and coarse assumption. This leads to the so-called hard-margin SVM, which is in the form of the hyperplane $\mathcal{H}$. Keep in mind that you will never achieve to get such a hyperplane if your training data are not linearly-sparable, since the objective function of the hard-margin SVM (i.e. the dual Lagrangian) will grow arbitrarily large. Then, you need to relax these constraints by introducing the slack variables $\xi_i$ which permit the training data to be non-linearly separable. the reason why there are $+1$, $-1$'s here is the indeed there is a normalization by the norm of $\mathbf{w}$.


Concerning your two additional questions:

(1) You have added that a slack variable might be introduced for non-linearly separable data - how do we know that the data is not linearly separable, as far as I understand the mapping via kernel has as a purpose to map the data into a vector space where in the optimal case it would be linearly separable (and why not using a non-linear discriminant function altogether instead of introducing a slack variable?)

First of all, you do not know a-priori that your data are linearly separable (in your input space). It is reasonable, though, that you want to solve the problem in both cases. We introduce the slack variables such that a training sample may enter the wrong class, contributing this way to the total loss. What we desire is to minimize this total loss. To be more specific, let $\xi_i\geq0$ be the aforementioned slack variables. When $0\leq\xi_i<1$, then the sample lies on the margin, but it is not yet misclassified. If $\xi_i=1$ then the sample lies on the contour (still not misclassified). However, if $\xi_i>1$, then the sample lies on the wrong class. In all the above cases (i.e., whenever holds that $\xi_i>0$) there is an error which must be measured (and minimized!). We measure the total error by summing up all the $\xi$'s distances, that is, $C\sum_{i}\xi_i$. $C$ is a user-specified positive parameter (a larger $C$ corresponding to assigning a higher penalty to errors).

Now, what about the kernels? We use a kernel because we know that by going to the feature space (i.e., the space where the kernel maps our input space) it is very likely that our data becomes linearly separable. Then we employ our linear method to classify them (linearly!). What is linear in the feature space is not linear in the original input space (granted that the selected kernel is not the linear kernel $k(\mathbf{x},\mathbf{x}')=\mathbf{x}\cdot\mathbf{x}'$). When we go back to the input space, the decision function we have found in the feature space is not linear.

(2) I've seen that for the optimization only one of the inequalities $w*x = b \geqslant 1$ is being used as a linear constrained - why?

This could not be true... The constraints are still $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)+\xi_i\geq1$, $\forall i$

Related Question