RKHS and the Kernel Trick seen in Machine Learning

functional-analysismachine learningreproducing-kernel-hilbert-spaces

So I have been reading about Reproducible Kernel Hilbert Spaces (RKHS), and I am confused with how people use it in regards to the kernel trick seen in Machine Learning.

For example, in this blog post (Section 4.), they provide an example saying that they define the map $\phi:\mathcal{X} \rightarrow \mathcal{H}$ to be $\phi(x) = K_x$ where $K_x$ is the reproducing kernel of the element $x \in \mathcal{X}$.

And then we have the kernel function:

$$K(x,z) = \langle \phi(x), \phi(z)\rangle_{\mathcal{H}}$$

Ok so far, I understand this as the "feature map" is a map from the set $\mathcal{X}$ to the hilbert space of functions $\mathcal{H}$.

However, in the example they provide, further down in section 4, they define $\phi(x)$ has a a vector in $\mathbb{R}^{10}$. This confuses me, as I thought that $\phi(x)$ should be an element of the hilbert space of functions $\mathcal{H}$ and not a vector.

Is it because, the vector represents a function by some linear combination of functions?

Best Answer

That blog post uses (imo) ambiguous notations and terminology, so I totally understand your confusion.

The short answer is that the definition given in that blog post is wrong : for a given kernel $K$, a feature map $\varphi$ is a map from $\mathcal X$ to any Hilbert space $F$ (called a feature space) such that the following relation holds : $$K(x,y)=\langle\varphi(x),\varphi(y) \rangle_F \ \ \text{for all } x,y\in \mathcal X \tag1 $$ In particular, observe that $F$ needs not be equal to $\mathcal H$. (In fact, there always exist infinitely many feature maps and feature spaces for a given kernel $K$)

When we use the kernel trick to solve problems in practice, we don't really care about the feature map itself : the whole point of using kernel methods is that we can implicitly perform computations in $\mathcal X$ while implicitly using the higher dimensional representations in $F$.

Feature maps are however useful as they tell us the dimensionality of the space $F$ in which the computations are implicitly performed (and we know that more dimensions implies more representation power).

Related Question