Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

machine learning

I'm studying section $5.11$ on support vector machines from Duda and Hart's Pattern Classification. The authors write:

With an appropriate nonlinear mapping $\phi$ to a sufficiently high
dimension, data from two categories can always be separated by a
hyperplane (Problem 27)

Suppose we have $n$ points of dimension $d$ with $n_1, n_2$ points in classes $1,2$ respectively.
There seem to be many posts that either show visualizations of kernels transforming non-linearly separable representations to separable ones, but very little information describing how these kernel functions are chosen for examples that are not hand-crafted, and not much proof of linear separability in the transformed space. Posts like these (https://stackoverflow.com/questions/3997454/help-me-understand-linear-separability-in-a-binary-svm).

I'm looking for a hint why such a kernel function always exists, assuming no restraints on the dimension of the transformed space.

Best Answer

In the exercise it is not required that the non-linear map $\phi$ is determined by a kernel function. If $x_1,\ldots ,x_{n_1}\in\mathbb{R}^d$ are the samples of class $1$ and $y_1,\ldots ,y_{n_2}$ are those of class $2$ one could map the samples $x_i$ to $(x_i,0)\in\mathbb{R}^{d+1}$ and $y_i$ to $(y_i,1)$ and extend the resulting map somehow to the whole of $\mathbb{R}^d$, if one wishes so. The hyperplane $x_{d+1}=\frac{1}{2}$ separates the two classes.

Related Question