Scikit-learn – Nystroem Method for Kernel Approximation

dimensionality reductionkernel tricknonlinearscikit learn

I have been reading about the Nyström method for low-rank kernel aproximation. This method is implemented in scikit-learn [1] as a method to project data samples to a low-rank approximation of the kernel feature mapping.

To the best of my knowledge, given a training set $\{x_i\}_{i=1}^n$ and a kernel function, it generates a low-rank approximation of the $n \times n$ kernel matrix $K$ by applying SVD to $W$ and $C$.

$K = \left [ \begin{array}{cc} W & K_{21}^T \\ K_{21} & K_{22} \end{array} \right ]$ $C = \left [\begin{array}{cc} W \\ K_{21} \end{array}\right ]$, $W \in \mathbb{R}^{l\times l}$

However, I do not understand how the low-rank approximation of the Kernel matrix can be used to project new samples to the approximated kernel feature space. The papers I have found (e.g. [2]) are not of great help, for they are little didactic.

Also, I am curious about the computational complexity of this method, both in training and test phases.

[1] http://scikit-learn.org/stable/modules/kernel_approximation.html#nystroem-kernel-approx

[2] http://www.jmlr.org/papers/volume13/kumar12a/kumar12a.pdf

Best Answer

Let's derive the Nyström approximation in a way that should make the answers to your questions clearer.

The key assumption in Nyström is that the kernel function is of rank $m$. (Really we assume that it's approximately of rank $m$, but for simplicity let's just pretend it's exactly rank $m$ for now.) That means that any kernel matrix is going to have rank at most $m$, and in particular $$ K = \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \dots & k(x_n, x_n) \end{bmatrix} ,$$ is rank $m$. Therefore there are $m$ nonzero eigenvalues, and we can write the eigendecomposition of $K$ as $$K = U \Lambda U^T$$ with eigenvectors stored in $U$, of shape $n \times m$, and eigenvalues arranged in $\Lambda$, an $m \times m$ diagonal matrix.

So, let's pick $m$ elements, usually uniformly at random but possibly according to other schemes – all that matters in this simplified version is that $K_{11}$ be of full rank. Once we do, just relabel the points so that we end up with the kernel matrix in blocks: $$ K = \begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{22} \end{bmatrix} ,$$ where we evaluate each entry in $K_{11}$ (which is $m \times m$) and $K_{21}$ ($(n-m) \times m$), but don't want to evaluate any entries in $K_{22}$.

Now, we can split up the eigendecomposition according to this block structure too: \begin{align} K &= U \Lambda U^T \\&= \begin{bmatrix}U_1 \\ U_2\end{bmatrix} \Lambda \begin{bmatrix}U_1 \\ U_2\end{bmatrix}^T \\&= \begin{bmatrix} U_1 \Lambda U_1^T & U_1 \Lambda U_2^T \\ U_2 \Lambda U_1^T & U_2 \Lambda U_2^T \end{bmatrix} ,\end{align} where $U_1$ is $m \times m$ and $U_2$ is $(n-m) \times m$. But note that now we have $K_{11} = U_1 \Lambda U_1^T$. So we can find $U_1$ and $\Lambda$ by eigendecomposing the known matrix $K_{11}$.

We also know that $K_{21} = U_2 \Lambda U_1^T$. Here, we know everything in this equation except $U_2$, so we can solve for what eigenvalues that implies: right-multiply both sides by $(\Lambda U_1^T)^{-1} = U_1 \Lambda^{-1}$ to get $$ U_2 = K_{21} U_1 \Lambda^{-1} .$$ Now we have everything we need to evaluate $K_{22}$: \begin{align} K_{22} &= U_2 \Lambda U_2^T \\&= \left(K_{21} U_1 \Lambda^{-1}\right) \Lambda \left(K_{21} U_1 \Lambda^{-1}\right)^T \\&= K_{21} U_1 (\Lambda^{-1} \Lambda) \Lambda^{-1} U_1^T K_{21}^T \\&= K_{21} U_1 \Lambda^{-1} U_1^T K_{21}^T \\&= K_{21} K_{11}^{-1} K_{21}^T \tag{*} \\&= \left( K_{21} K_{11}^{-\frac12} \right) \left( K_{21} K_{11}^{-\frac12} \right)^T \tag{**} .\end{align}

In (*), we've found a version of the Nyström embedding you might have seen simply as the definition. This tells us the effective kernel values that we're imputing for the block $K_{22}$.

In (**), we see that the feature matrix $K_{21} K_{11}^{-\frac12}$, which is shape $(n-m) \times m$, corresponds to these imputed kernel values. If we use $K_{11}^{\frac12}$ for the $m$ points, we have a set of $m$-dimensional features $$ \Phi = \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix} .$$ We can just quickly verify that $\Phi$ corresponds to the correct kernel matrix: \begin{align} \Phi \Phi^T &= \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix} \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix}^T \\&=\begin{bmatrix} K_{11}^{\frac12} K_{11}^{\frac12} & K_{11}^{\frac12} K_{11}^{-\frac12} K_{21}^T \\ K_{21} K_{11}^{-\frac12} K_{11}^{\frac12} & K_{21} K_{11}^{-\frac12} K_{11}^{-\frac12} K_{21}^T \end{bmatrix} \\&=\begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{21} K_{11}^{-1} K_{21}^T \end{bmatrix} \\&= K .\end{align}

So, all we need to do is train our regular learning model with the $m$-dimensional features $\Phi$. This will be exactly the same (under the assumptions we've made) as the kernelized version of the learning problem with $K$.

Now, for an individual data point $x$, the features in $\Phi$ correspond to $$ \phi(x) = \begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix} K_{11}^{-\frac12} .$$ For a point $x$ in partition 2, the vector $\begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix}$ is just the relevant row of $K_{21}$, so that stacking these up gives us $K_{21} K_{11}^{-\frac12}$ – so $\phi(x)$ agrees for points in partition 2. It also works in partition 1: there, the vector is a row of $K_{11}$, so stacking them up gets $K_{11} K_{11}^{-\frac12} = K_{11}^{\frac12}$, again agreeing with $\Phi$. So...it's still true for an unseen-at-training-time test point $x_\text{new}$. You just do the same thing: $$ \Phi_\text{test} = K_{\text{test},1} K_{11}^{-\frac12} .$$ Because we assumed the kernel is rank $m$, the matrix $\begin{bmatrix}K_{\text{train}} & K_{\text{train,test}} \\ K_{\text{test,train}} & K_{\text{test}} \end{bmatrix}$ is also of rank $m$, and the reconstruction of $K_\text{test}$ is still exact by exactly the same logic as for $K_{22}$.


Above, we assumed that the kernel matrix $K$ was exactly rank $m$. This is not usually going to be the case; for a Gaussian kernel, for example, $K$ is always rank $n$, but the latter eigenvalues typically drop off pretty quickly, so it's going to be close to a matrix of rank $m$, and our reconstructions of $K_{21}$ or $K_{\text{test},1}$ are going to be close to the true values but not exactly the same. They'll be better reconstructions the closer the eigenspace of $K_{11}$ gets to that of $K$ overall, which is why choosing the right $m$ points is important in practice.

Note also that if $K_{11}$ has any zero eigenvalues, you can replace inverses with pseudoinverses and everything still works; you just replace $K_{21}$ in the reconstruction with $K_{21} K_{11}^\dagger K_{11}$.

You can use the SVD instead of the eigendecomposition if you'd like; since $K$ is psd, they're the same thing, but the SVD might be a little more robust to slight numerical error in the kernel matrix and such, so that's what scikit-learn does. scikit-learn's actual implementation does this, though it uses $\max(\lambda_i, 10^{-12})$ in the inverse instead of the pseudoinverse.