Solved – Kernel approximation with Nystroem method and usage in scikit-learn

approximationgaussian processkernel trickrbf kernelregression

I am planning to use the Nystroem method to approximate a Gram matrix induced by any kernel function. I found the Nystroem implementation in scikit-learn.

As far as I understood, the full Gram Matrix should be estimated. Let have $x_1, \ldots, x_n$ as data points where $x_i \in \mathbb{R}^d$ for all $i = 1 \ldots n$. My goal is to build a Kernel Matrix containing each pair $k(x_i, x_j)$. Then the output should be $\tilde{G} \in \mathbb{R}^{n \times n}.$ The scikit-learn implementation, however, returns $\tilde{G} \in \mathbb{R}^{n \times m}$ where $m$ is the number of components the user has given for the lower-rank approximation.

How can the full Gram/Kernel Matrix be approximated using scikit-learn?

# imports
from sklearn.kernel_approximation import Nystroem
from sklearn.gaussian_process.kernels import RBF
    
# creating data
x = np.random.normal(size=(100, 2))

# accurate kernel function
kernel = RBF()
gram_matrix = kernel(x)
    
# approximated kernel function
m = 50
kernel_approx = Nystroem(kernel, n_components=m)
gram_matrix_approx = kernel_approx.fit_transform(x)
    
if gram_matrix.shape == gram_matrix_approx.shape:
    print('True')
else:
    print('False')

The shapes are always different. Why?

Best Answer

Sklearn's Nystroem does not compute the Gram matrix itself, it returns the Feature map $\Phi$. The exact kernel matrix is approximated by $\tilde{G} = \Phi \Phi^\top$. Your code should look like this:

kernel_approx = Nystroem(kernel, n_components=m)
feature_matrix = kernel_approx.fit_transform(x)
gram_matrix_approx = feature_matrix @ feature_matrix.T

And then

if gram_matrix.shape == gram_matrix_approx.shape:
    print('True')
else:
    print('False')

would print what you expect.

You can check how good an approximation you got by visually comparing

import matplotlib.pyplot as plt
plt.matshow(gram_matrix)
plt.matshow(gram_matrix_approx)
plt.show()

If you do want the $n \times n$ Gram matrix, it will take $\mathcal{O}(n^2)$ time if you compute it directly, and $\mathcal{O}(n^2 m)$ time if you first compute the Nystroem approximation and then evaluate $\Phi \Phi^\top$, so computing it directly would be faster.

Related Question