Understanding Kernel Ridge Regression from an RKHS point of view

machine learningreproducing-kernel-hilbert-spacesstatistical-inference

I'd like to verify if my understanding of kernel ridge regression from a mathematical point of view is correct. Especially why this is often described as linear in a high-dimensional space.

Step 1: The Representer Theorem

Suppose we are given a dataset $\{(y_{i},x_{i} \}_{i=1}^{n}$ with $y_{i}\in \mathbb{R}$, $x_{i}\in\mathbb{R}^{d}$ and a kernel function $K:\mathbb{R}^{d}\times\mathbb{R}^{d} \rightarrow \mathbb{R}$ inducing the reproducing kernel hilbert space $\mathcal{H}(K)$.
The Representer Theorem (8.7 Paulsen 2009) tells us that the solution to the minimization problem

$$argmin_{f \in \mathcal{H}(K)} \sum\limits_{i=1}^{n} (y_{i}-f(x_{i}))^{2}+\lambda ||f||_{\mathcal{H}}^{2}$$

is given by a function $\hat{f} \in span \{ k_{x_{1}}, \ldots, k_{x_{n}} \}$. Therefore $\hat{f}$ can be written as $\hat{f}=\sum\limits_{i=1}^{n}\alpha_{i}k_{x_{i}}$ where $k_{x_{i}}$ are the reproducing kernels.
Substituting $f$ yields the expression

$$
\begin{aligned}
argmin_{f \in \mathcal{H}(K)} \sum\limits_{i=1}^{n} (y_{i}-f(x_{i}))^{2}+\lambda \left|\left|f\right|\right|_{\mathcal{H}}^{2}
& =
argmin_{\alpha \in \mathbb{R}^{n}} \sum\limits_{i=1}^{n} \left(y_{i}-\sum\limits_{j=1}^{n}\alpha_{i}k_{x_{j}}(x_{i})\right)^{2}+\lambda \left|\left|\sum\limits_{j=1}^{n}\alpha_{i}k_{x_{j}}\right|\right|_{\mathcal{H}}^{2}\\
& =
argmin_{\alpha \in \mathbb{R}^{n}} \sum\limits_{i=1}^{n} \left(y_{i}-(K(x_{i},x_{j}))_{i} \alpha \right)^{2}+\lambda \left|\left|\sum\limits_{j=1}^{n}\alpha_{i}k_{x_{j}}\right|\right|_{\mathcal{H}}^{2}\\
& =
argmin_{\alpha \in \mathbb{R}^{n}} \left(y-K\alpha \right)^{T}
\left(y-K\alpha \right)+\lambda\alpha^{T}K\alpha
\end{aligned}
$$

Therefore the problem of finding a function in the infinity dimensional function space $\mathcal{H}(K)$ reduced to a minimization problem in $\mathbb{R}^{n}$
which has the solution $$\hat{\alpha}=(K+\lambda I)^{-1}y$$.

Step 2: Pull-Back Theorem

Let $\phi:X\rightarrow Y$ be the pull-back (or feature map) and $K:Y\times Y \rightarrow \mathbb{C}$ a kernel function. From the pull-back theorem (5.7 Paulsen 2009) we can deduce that $K\circ\phi: X \times X \rightarrow \mathbb{C}$ is also a Kernel and $\mathcal{H}(K \circ \phi):=\{f\circ \phi : f\in \mathcal{H}(K)\}$

Step 3: Kernel Ridge Regression

Let $\phi: X \rightarrow Y$ be the feature map and $K_{Y}: Y \times Y \rightarrow \mathbb{C}, \, K_{Y}(x,y)= \langle x, y \rangle$. Then $\mathcal{H}(K_{Y})$ is the RKHS of bounded linear functionals over $Y$. Using the pull-back theorem, $K_{X}(x,y):= (K_{Y} \circ \phi)(x,y)=\langle \phi(x) ,\phi(y)\rangle $ is a kernel function on $X$. Given a dataset $\{(y_{i},x_{i} \}_{i=1}^{n}$ we want to solve the minimization problem

$$argmin_{f \in \mathcal{H}(K)} \sum\limits_{i=1}^{n} (y_{i}-f(x_{i}))^{2}+\lambda ||f||_{\mathcal{H}}^{2}$$

The solution to this problem was derived under Step 1 and is given by $$\hat{\alpha}=(K+\lambda I)^{-1}y$$ with
$$K=
\left(
\begin{array}{cccc}
K_{\mathcal{H}}(\phi(x_{1}),\phi(x_{1})) & K_{\mathcal{H}}(\phi(x_{1}),\phi(x_{2})) & \ldots & K_{\mathcal{H}}(\phi(x_{1}),\phi(x_{n}))\\
\vdots & \vdots & \vdots & \vdots \\
K_{\mathcal{H}}(\phi(x_{n}),\phi(x_{1})) & K_{\mathcal{H}}(\phi(x_{n}),\phi(x_{1})) & \ldots & K_{\mathcal{H}}(\phi(x_{n}),\phi(x_{n}))\\
\end{array}
\right)
=
\left(
\begin{array}{cccc}
\langle \phi(x_{1}),\phi(x_{1} \rangle & \langle \phi(x_{1}),\phi(x_{2} \rangle & \ldots & \langle\phi(x_{1}),\phi(x_{n}\rangle\\
\vdots & \vdots & \vdots & \vdots \\
\langle \phi(x_{n}) , \phi(x_{1} \rangle & \langle \phi(x_{n}) , \phi(x_{1} \rangle & \ldots & \langle \phi(x_{n}),\phi(x_{n} \rangle \\
\end{array}
\right)
$$

Conclusion

$\mathcal{H}(K_{Y})$ is the RKHS of bounded linear functionals and due to the pull-back theorem the function $f\in \mathcal{H}(K_{X})$ is linear in $Y$ which gives rise to the above mentioned interpretation.
Is this correct?

Best Answer

Most of what you've written looks good to me, there is just a terminology issue : you say that "$\mathcal H$ is a RKHS of bounded linear functionals", but that is not correct. Indeed, $\mathcal H$ is a Hilbert Space of functions, not functionals, and they need not be linear nor bounded (consider any non-linear kernel : it induces a non-linear function in $\mathcal H(K)$ and you can scale it by any arbitrarily large constant).

What is true however, is that for every point $x$ in the space $X$ over which the functions of $\mathcal H$ are defined, the evaluation functional $T_x:\mathcal H\ni f \mapsto f(x)$ is bounded and linear. That is simply the definition of a RKHS.

Related Question