Solved – Kernel approximation with Nystroem method and usage in scikit-learn

approximationgaussian processkernel trickrbf kernelregression

I am planning to use the Nystroem method to approximate a Gram matrix induced by any kernel function. I found the Nystroem implementation in scikit-learn.

As far as I understood, the full Gram Matrix should be estimated. Let have $x_1, \ldots, x_n$ as data points where $x_i \in \mathbb{R}^d$ for all $i = 1 \ldots n$. My goal is to build a Kernel Matrix containing each pair $k(x_i, x_j)$. Then the output should be $\tilde{G} \in \mathbb{R}^{n \times n}.$ The scikit-learn implementation, however, returns $\tilde{G} \in \mathbb{R}^{n \times m}$ where $m$ is the number of components the user has given for the lower-rank approximation.

How can the full Gram/Kernel Matrix be approximated using scikit-learn?

# imports
from sklearn.kernel_approximation import Nystroem
from sklearn.gaussian_process.kernels import RBF
    
# creating data
x = np.random.normal(size=(100, 2))

# accurate kernel function
kernel = RBF()
gram_matrix = kernel(x)
    
# approximated kernel function
m = 50
kernel_approx = Nystroem(kernel, n_components=m)
gram_matrix_approx = kernel_approx.fit_transform(x)
    
if gram_matrix.shape == gram_matrix_approx.shape:
    print('True')
else:
    print('False')

The shapes are always different. Why?

Best Answer

Sklearn's Nystroem does not compute the Gram matrix itself, it returns the Feature map $\Phi$. The exact kernel matrix is approximated by $\tilde{G} = \Phi \Phi^\top$. Your code should look like this:

kernel_approx = Nystroem(kernel, n_components=m)
feature_matrix = kernel_approx.fit_transform(x)
gram_matrix_approx = feature_matrix @ feature_matrix.T

And then

if gram_matrix.shape == gram_matrix_approx.shape:
    print('True')
else:
    print('False')

would print what you expect.

You can check how good an approximation you got by visually comparing

import matplotlib.pyplot as plt
plt.matshow(gram_matrix)
plt.matshow(gram_matrix_approx)
plt.show()

If you do want the $n \times n$ Gram matrix, it will take $\mathcal{O}(n^2)$ time if you compute it directly, and $\mathcal{O}(n^2 m)$ time if you first compute the Nystroem approximation and then evaluate $\Phi \Phi^\top$, so computing it directly would be faster.

Related Solutions

Solved – How to apply a Gaussian radial basis function kernel PCA to nonlinear data

The first problem seems to be that the sign of gamma is wrong (it should be negative: $-15$, as in the definition of the kernel, not as in your code). Alternatively, use exp(-gamma * mat_sq_dists).

The second problem is that you clobber the eigenvectors with your invocation of zip's when you sort the list. The $i$-th eigenvector is eigvecs[:,i], not eigvecs[i,:], according to scipy.linalg.eigh (also: you should prefer eigh to eig because you have a symmetric real matrix).

Replace

< gamma = 15
> gamma = -15

and (to get ordered, real eigenvalues)

< eigvals, eigvecs = np.linalg.eig(K)
> eigvals, eigvecs = scipy.linalg.eigh(K)

and

< eigvals, eigvecs = zip(*sorted(zip(eigvals, eigvecs), reverse=True))
< X_pc1 = eigvecs[0]
> X_pc1 = eigvecs[:,99]

Finally, you can examine scikit-learn's own implementation here.

Scikit-learn – Nystroem Method for Kernel Approximation

Let's derive the Nyström approximation in a way that should make the answers to your questions clearer.

The key assumption in Nyström is that the kernel function is of rank $m$. (Really we assume that it's approximately of rank $m$, but for simplicity let's just pretend it's exactly rank $m$ for now.) That means that any kernel matrix is going to have rank at most $m$, and in particular $$ K = \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \dots & k(x_n, x_n) \end{bmatrix} ,$$ is rank $m$. Therefore there are $m$ nonzero eigenvalues, and we can write the eigendecomposition of $K$ as $$K = U \Lambda U^T$$ with eigenvectors stored in $U$, of shape $n \times m$, and eigenvalues arranged in $\Lambda$, an $m \times m$ diagonal matrix.

So, let's pick $m$ elements, usually uniformly at random but possibly according to other schemes – all that matters in this simplified version is that $K_{11}$ be of full rank. Once we do, just relabel the points so that we end up with the kernel matrix in blocks: $$ K = \begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{22} \end{bmatrix} ,$$ where we evaluate each entry in $K_{11}$ (which is $m \times m$) and $K_{21}$ ($(n-m) \times m$), but don't want to evaluate any entries in $K_{22}$.

Now, we can split up the eigendecomposition according to this block structure too: \begin{align} K &= U \Lambda U^T \\&= \begin{bmatrix}U_1 \\ U_2\end{bmatrix} \Lambda \begin{bmatrix}U_1 \\ U_2\end{bmatrix}^T \\&= \begin{bmatrix} U_1 \Lambda U_1^T & U_1 \Lambda U_2^T \\ U_2 \Lambda U_1^T & U_2 \Lambda U_2^T \end{bmatrix} ,\end{align} where $U_1$ is $m \times m$ and $U_2$ is $(n-m) \times m$. But note that now we have $K_{11} = U_1 \Lambda U_1^T$. So we can find $U_1$ and $\Lambda$ by eigendecomposing the known matrix $K_{11}$.

We also know that $K_{21} = U_2 \Lambda U_1^T$. Here, we know everything in this equation except $U_2$, so we can solve for what eigenvalues that implies: right-multiply both sides by $(\Lambda U_1^T)^{-1} = U_1 \Lambda^{-1}$ to get $$ U_2 = K_{21} U_1 \Lambda^{-1} .$$ Now we have everything we need to evaluate $K_{22}$: \begin{align} K_{22} &= U_2 \Lambda U_2^T \\&= \left(K_{21} U_1 \Lambda^{-1}\right) \Lambda \left(K_{21} U_1 \Lambda^{-1}\right)^T \\&= K_{21} U_1 (\Lambda^{-1} \Lambda) \Lambda^{-1} U_1^T K_{21}^T \\&= K_{21} U_1 \Lambda^{-1} U_1^T K_{21}^T \\&= K_{21} K_{11}^{-1} K_{21}^T \tag{*} \\&= \left( K_{21} K_{11}^{-\frac12} \right) \left( K_{21} K_{11}^{-\frac12} \right)^T \tag{**} .\end{align}

In (*), we've found a version of the Nyström embedding you might have seen simply as the definition. This tells us the effective kernel values that we're imputing for the block $K_{22}$.

In (**), we see that the feature matrix $K_{21} K_{11}^{-\frac12}$, which is shape $(n-m) \times m$, corresponds to these imputed kernel values. If we use $K_{11}^{\frac12}$ for the $m$ points, we have a set of $m$-dimensional features $$ \Phi = \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix} .$$ We can just quickly verify that $\Phi$ corresponds to the correct kernel matrix: \begin{align} \Phi \Phi^T &= \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix} \begin{bmatrix} K_{11}^{\frac12} \\ K_{21} K_{11}^{-\frac12} \end{bmatrix}^T \\&=\begin{bmatrix} K_{11}^{\frac12} K_{11}^{\frac12} & K_{11}^{\frac12} K_{11}^{-\frac12} K_{21}^T \\ K_{21} K_{11}^{-\frac12} K_{11}^{\frac12} & K_{21} K_{11}^{-\frac12} K_{11}^{-\frac12} K_{21}^T \end{bmatrix} \\&=\begin{bmatrix} K_{11} & K_{21}^T \\ K_{21} & K_{21} K_{11}^{-1} K_{21}^T \end{bmatrix} \\&= K .\end{align}

So, all we need to do is train our regular learning model with the $m$-dimensional features $\Phi$. This will be exactly the same (under the assumptions we've made) as the kernelized version of the learning problem with $K$.

Now, for an individual data point $x$, the features in $\Phi$ correspond to $$ \phi(x) = \begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix} K_{11}^{-\frac12} .$$ For a point $x$ in partition 2, the vector $\begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix}$ is just the relevant row of $K_{21}$, so that stacking these up gives us $K_{21} K_{11}^{-\frac12}$ – so $\phi(x)$ agrees for points in partition 2. It also works in partition 1: there, the vector is a row of $K_{11}$, so stacking them up gets $K_{11} K_{11}^{-\frac12} = K_{11}^{\frac12}$, again agreeing with $\Phi$. So...it's still true for an unseen-at-training-time test point $x_\text{new}$. You just do the same thing: $$ \Phi_\text{test} = K_{\text{test},1} K_{11}^{-\frac12} .$$ Because we assumed the kernel is rank $m$, the matrix $\begin{bmatrix}K_{\text{train}} & K_{\text{train,test}} \\ K_{\text{test,train}} & K_{\text{test}} \end{bmatrix}$ is also of rank $m$, and the reconstruction of $K_\text{test}$ is still exact by exactly the same logic as for $K_{22}$.

Above, we assumed that the kernel matrix $K$ was exactly rank $m$. This is not usually going to be the case; for a Gaussian kernel, for example, $K$ is always rank $n$, but the latter eigenvalues typically drop off pretty quickly, so it's going to be close to a matrix of rank $m$, and our reconstructions of $K_{21}$ or $K_{\text{test},1}$ are going to be close to the true values but not exactly the same. They'll be better reconstructions the closer the eigenspace of $K_{11}$ gets to that of $K$ overall, which is why choosing the right $m$ points is important in practice.

Note also that if $K_{11}$ has any zero eigenvalues, you can replace inverses with pseudoinverses and everything still works; you just replace $K_{21}$ in the reconstruction with $K_{21} K_{11}^\dagger K_{11}$.

You can use the SVD instead of the eigendecomposition if you'd like; since $K$ is psd, they're the same thing, but the SVD might be a little more robust to slight numerical error in the kernel matrix and such, so that's what scikit-learn does. scikit-learn's actual implementation does this, though it uses $\max(\lambda_i, 10^{-12})$ in the inverse instead of the pseudoinverse.

Best Answer

Related Solutions

Solved – How to apply a Gaussian radial basis function kernel PCA to nonlinear data

Scikit-learn – Nystroem Method for Kernel Approximation

Related Question