Solved – Kernel Ridge Regression Algorithmic Efficiency

kernel trickregressionridge regression

Ridge Regression can be expressed as $$\hat{y} = (\mathbf{X'X} + a\mathbf{I}_d)^{-1}\mathbf{X}x$$ where $\hat{y}$ is the predicted label, $\mathbf{I}_d$ the $d \times d$ identify matrix, $\mathbf{x}$ the object we're trying to find a label for, and $\mathbf{X}$ the $n \times d$ matrix of $n$ objects $\mathbf{x}_i = (x_{i,1}, …, x_{i,d})\in \mathbb{R}^d$ such that:

$$
\mathbf{X} =
\begin{pmatrix}
x_{1,1} & x_{1,2} & \ldots & x_{1,d}\\
x_{2,1} & x_{2,2} & \ldots & x_{2,d}\\
\vdots & \vdots & \ddots & \vdots\\
x_{n,1} & x_{1,2} &\ldots & x_{n,d}
\end{pmatrix}
$$

We can kernelise this as follows: $$\hat{y} = (\mathbf{\mathcal{K}} + a\mathbf{I}_d)^{-1} \mathbf{k}$$

where $\mathbf{\mathcal{K}}$ is the $n \times n$ matrix of kernel functions $K$

$$\mathcal{K} =
\begin{pmatrix}
K(\mathbf{x}_1,\mathbf{x}_1) & K(\mathbf{x}_1,\mathbf{x}_2) & \ldots & K(\mathbf{x}_1,\mathbf{x}_n)\\
K(\mathbf{x}_2,\mathbf{x}_1) & K(\mathbf{x}_2,\mathbf{x}_2) & \ldots & K(\mathbf{x}_2,\mathbf{x}_n)\\
\vdots & \vdots & \ddots & \vdots\\
K(\mathbf{x}_n,\mathbf{x}_1) & K(\mathbf{x}_n,\mathbf{x}_2) &\ldots & K(\mathbf{x}_n,\mathbf{x}_n)
\end{pmatrix} $$

and $\mathbf{k}$ the $n \times 1$ column vector of kernel functions $K$

$$\mathbf{k} =
\begin{pmatrix}
K(\mathbf{x}_1,\mathbf{x})\\
K(\mathbf{x}_2,\mathbf{x}) \\
\vdots \\
K(\mathbf{x}_n,\mathbf{x})
\end{pmatrix}$$

Questions:

(a) if there are more objects $\mathbf{x}_i$ than dimensions does it make sense to not use kernels? E.g. let $\mathbf{X}$ be a $50 \times 3$ matrix then $\mathbf{X}'\mathbf{X}$ will be a $3 \times 3$ and we will end up inverting a $3 \times 3$ matrix instead of the $50 \times 50$ matrix we would have to invert were we to use kernels. Does this mean that if $d \leq n$ we shouldn't use kernels?

(b) should the simplest possible kernel be used? It seems that kernels in ridge regression are used to negate the influences of dimensionality and not to utilise certain properties of the feature space (unlike support vector machines). Although, kernels can change the distances between objects so are there any popular kernels oft used in ridge regression?

(c) what is the $O$ time complexity of ridge regression and/or kernel ridge regression?

Best Answer

(a) The purpose of using a kernel is to solve a nonlinear regression problem in this case. A good kernel will allow you to solve problems in a possibly infinite-dimensional feature space. But, using a linear kernel $K(\mathbf{x,y}) = \mathbf{x}^\top \mathbf{y}$ and doing the kernel ridge regression in the dual space is same as solving the problem in the primal space, i.e., it doesn't bring any advantage (it's just much slower as the number of sample grows as you observed).

(b) One of the most popular choices is the squared exponential kernel $K(x,y) = \exp(-\frac{\tau}{2} ||\mathbf{x}-\mathbf{y}||^2)$ which is universal (see ref below). There are many many kernels, and each of them will induce different inner product (and hence metric) to your feature space.

(c) Straightforward implementation requires solving a linear equation of size $n$, so it's $O(n^3)$. There are many faster approximation methods such as Nyström approximation. This is an area of active research.

References:

  1. Bharath Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet. On the relation between universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 9:773–780, 2010.
  2. Bernhard Schlkopf, Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond 2002
Related Question