Kernel Trick – Understanding Moore-Aronszajn and Mercer Theorems for Kernel Methods

kernel trick

I have been reading about the RKHS and the kernel trick in Machine Learning mainly from https://ngilshie.github.io/jekyll/update/2018/02/01/RKHS.html (1) and https://arxiv.org/pdf/2106.08443.pdf (2). But in (1), it is stated that because the quadratic kernel is positive semi-definite so by Moore-Aronszajn, it is a reproducing kernel for an RKHS with the feature map $\varphi(x) = K_x$ so instead of computing the feature map, we can directly compute the kernel. While in (2), the author uses Mercer theorem to say something similar i.e for a positive semi-definite kernel, there exist feature maps but we do not need to explicitly compute these feature maps to find the inner product.

My question is that how do these two theorems differ in the context of the kernel trick and to what extent?

Best Answer

Mercer's theorem and the Moore-Aronszajn theorem are often reproduced in varying forms, but in essence, the difference and similarity between them are roughly as follows:

Let $k: X\times X\to \mathbb{R}$ be a kernel with $k$ and $X$ satisfying some required properties, such as $k$ being positive definite and continuous, and $X$ being compact. Then you can define an integral operator $T_k$ on the $L_2$-functions on $X$: $$ T_k(f)(x) = \int k(x, y) f(y) d\nu $$ for some appropriate measure $\nu$.

Mercer's theorem

Mercer's theorem creates a feature map $\Phi_M$ from $X$ to a Hilbert space by first computing the eigenfunctions $\phi_k$ of $T_k$ and then mainly just evaluating all of them at the point $x$, thus giving a map $\Phi_M: X\to \ell_2$, using the standard inner product on $\ell_2$.

Moore-Aronszajn theorem

The work of Moore and Aronszajn on the other hand takes a different approach. They consider for each $x\in X$ the function $f_x: y \to k(x, y)$ and consider the Hilbert space of continuous functions $C(X)$ with the inner product given by $(f_x, f_y) = k(x, y)$. Thus, the feature map $\Phi_{MA}$ is here given as $\Phi_{MA}: X\to C(X), \Phi_{MA}(x) = f_x$.

Comparison:

Even though the approaches look different, it can be shown that they are equivalent, in the sense that both map the kernel to the scalar product, i.e. both $\Phi_M$ and $\Phi_{MA}$ satisfy: $$ k(x,y) = \langle \Phi(x), \Phi(y) \rangle, $$ and that both Hilbert spaces are isometric.

The above information has been extracted mainly from:

Cucker, Felipe, and Steve Smale. "On the mathematical foundations of learning." Bulletin of the American mathematical society 39.1 (2002): 1-49.

Related Question