Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

machine learning

I'm studying section $5.11$ on support vector machines from Duda and Hart's Pattern Classification. The authors write:

With an appropriate nonlinear mapping $\phi$ to a sufficiently high
dimension, data from two categories can always be separated by a
hyperplane (Problem 27)

Suppose we have $n$ points of dimension $d$ with $n_1, n_2$ points in classes $1,2$ respectively.
There seem to be many posts that either show visualizations of kernels transforming non-linearly separable representations to separable ones, but very little information describing how these kernel functions are chosen for examples that are not hand-crafted, and not much proof of linear separability in the transformed space. Posts like these (https://stackoverflow.com/questions/3997454/help-me-understand-linear-separability-in-a-binary-svm).

I'm looking for a hint why such a kernel function always exists, assuming no restraints on the dimension of the transformed space.

Best Answer

In the exercise it is not required that the non-linear map $\phi$ is determined by a kernel function. If $x_1,\ldots ,x_{n_1}\in\mathbb{R}^d$ are the samples of class $1$ and $y_1,\ldots ,y_{n_2}$ are those of class $2$ one could map the samples $x_i$ to $(x_i,0)\in\mathbb{R}^{d+1}$ and $y_i$ to $(y_i,1)$ and extend the resulting map somehow to the whole of $\mathbb{R}^d$, if one wishes so. The hyperplane $x_{d+1}=\frac{1}{2}$ separates the two classes.

Related Solutions

[Math] Hyperplanes and Support Vector Machines

I suggest to you to thoroughly study a really insightful tutorial about Support Vector Machines for Pattern Recognition due to C. Burges. It isn't elementary, but it gives significant information about SV learning. It really helped me as a beginner!

Concerning your first question. $\mathcal{H}\colon\mathbf{w}\cdot\mathcal{x}+b=0$ describes the locus of points belonging to hyperplane $\mathcal{H}$. $\mathbf{x}$ denotes an arbitrary vector in your input space, that is, the space where your training samples (i.e. $\mathbf{x}_i$, $i=1,\ldots,l$) live. Often this is some subset of $\mathbb{R}^n$.

As for your second question, granted that you have learned some separating hyperplane, $\mathcal{H}\colon\mathbf{w}\cdot\mathcal{x}+b=0$ (and this is exactly what a linear SVM does), then given some unseen (that is, a new, without truth label) datum $\mathbf{x}_t$, you have to classify it to some class, by giving it a truth label $y_t\in\{\pm1\}$. The way you do so is by checking the sign of the decision function $f(\mathbf{x})=\mathbf{w}\cdot\mathcal{x}+b$, that is, $y_t=\operatorname{sgn}(f(\mathbf{x}_t))$. $\operatorname{sgn}(f(\mathbf{x}_t))>0$ means that your datum is "above" the separating hyperplane and thus has to be classified as positive, while $\operatorname{sgn}(f(\mathbf{x}_t))<0$ means that your datum is "under" the separating hyperplane and thus has to be classified as negative. If $\operatorname{sgn}(f(\mathbf{x}_t))=0$ it means that you testing sample belongs to the separating hyperplane, and thus there cannot be some decision about its truth label.

Note that if $-1<f(\mathbf{x}_t)<+1$ then the testing datum lies inside the so-called margin of the classifier (take a look at the tutorial above for more information).

EDIT

I would like to add some comments below motivated by the discussion between @Pegah and me in the comments.

I guess because in other contexts the vectors $\mathbf{x}$ in the equation for hyperplanes refer to the points on the hyperplane, but here $\mathbf{x}$ refer to input vectors which are not necessarily on it, which seemed ambiguous to me.

This is not correct, actually. $\mathbf{x}\in\mathbb{R}^n$ refers to any point in the Euclidean $n$-dimensional space. Now, if and only if such a point belongs to the hyperplane $\mathcal{H}$, it satisfies the equation $\mathbf{w}\cdot\mathcal{x}+b=0$. In other words, the hyperplane $\mathcal{H}$ is the locus of the $n$-dimensional points that satisfy $\mathbf{w}\cdot\mathcal{x}+b=0$. For all other point it must hold that $\mathbf{w}\cdot\mathcal{x}+b>0$ or $\mathbf{w}\cdot\mathcal{x}+b<0$.

I have another question though if you don't mind: I do understand that we can label new input vectors with $-1$ and $+1$, but why (as in the script linked by you) do we assume that $\mathbf{w}\cdot\mathcal{x}+b\leq-1$ and $\mathbf{w}\cdot\mathcal{x}+b\geq+1$ respectively? I don't see that explained anywhere, and am wondering why it could not be a and $-a$ for some $a\in\mathbb{R}$. Is it simply scaling (and are hence both sides of the inequality scaled, i.e. divided by $a$?).

The above inequalities are our constraints to state that the training data are linearly separable, which -of course- is a very strict and coarse assumption. This leads to the so-called hard-margin SVM, which is in the form of the hyperplane $\mathcal{H}$. Keep in mind that you will never achieve to get such a hyperplane if your training data are not linearly-sparable, since the objective function of the hard-margin SVM (i.e. the dual Lagrangian) will grow arbitrarily large. Then, you need to relax these constraints by introducing the slack variables $\xi_i$ which permit the training data to be non-linearly separable. the reason why there are $+1$, $-1$'s here is the indeed there is a normalization by the norm of $\mathbf{w}$.

Concerning your two additional questions:

(1) You have added that a slack variable might be introduced for non-linearly separable data - how do we know that the data is not linearly separable, as far as I understand the mapping via kernel has as a purpose to map the data into a vector space where in the optimal case it would be linearly separable (and why not using a non-linear discriminant function altogether instead of introducing a slack variable?)

First of all, you do not know a-priori that your data are linearly separable (in your input space). It is reasonable, though, that you want to solve the problem in both cases. We introduce the slack variables such that a training sample may enter the wrong class, contributing this way to the total loss. What we desire is to minimize this total loss. To be more specific, let $\xi_i\geq0$ be the aforementioned slack variables. When $0\leq\xi_i<1$, then the sample lies on the margin, but it is not yet misclassified. If $\xi_i=1$ then the sample lies on the contour (still not misclassified). However, if $\xi_i>1$, then the sample lies on the wrong class. In all the above cases (i.e., whenever holds that $\xi_i>0$) there is an error which must be measured (and minimized!). We measure the total error by summing up all the $\xi$'s distances, that is, $C\sum_{i}\xi_i$. $C$ is a user-specified positive parameter (a larger $C$ corresponding to assigning a higher penalty to errors).

Now, what about the kernels? We use a kernel because we know that by going to the feature space (i.e., the space where the kernel maps our input space) it is very likely that our data becomes linearly separable. Then we employ our linear method to classify them (linearly!). What is linear in the feature space is not linear in the original input space (granted that the selected kernel is not the linear kernel $k(\mathbf{x},\mathbf{x}')=\mathbf{x}\cdot\mathbf{x}'$). When we go back to the input space, the decision function we have found in the feature space is not linear.

(2) I've seen that for the optimization only one of the inequalities $w*x = b \geqslant 1$ is being used as a linear constrained - why?

This could not be true... The constraints are still $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)+\xi_i\geq1$, $\forall i$

Convergence theorems for Kernel SVM and Kernel Perceptron

In the proof that you have written, only the properties of inner product and norm are used, which hold for any Hilbert space. So, as you have already pointed out, the proof carries straightly to a Hilbert space.

When the data is not linearly separable, a specific kernel function $K(\cdot, \cdot)$ is used to map the data to a Hilbert space, where the linear separability is more likely to hold. The proofs of convergence uses the properties of norms and inner products, and hence applicable to any Hilbert space.

Note that the Euclidean space $\mathbb{R}^d$ is a finite dimensional Hilbert space. The principal difference between $\mathbb{R}^d$ and an arbitrary Hilbert space, in terms of validity of proofs of any mathematical result, is the property of compactness. In $\mathbb{R}^d$, any closed and bounded set is compact, but it is not true for an infinite dimensional Hilbert space. As long as a proof or derivation in $\mathbb{R}^d$ does not use the property of compact sets, it will carry straight to a Hilbert space.

Using the validity of the kernel SVM, it has already been extended to the case where the observations are elements of an infinite dimensional Hilbert space, e.g., random functions. For example, see the papers here.

Best Answer

Related Solutions

[Math] Hyperplanes and Support Vector Machines

Convergence theorems for Kernel SVM and Kernel Perceptron

Related Question