Relationship between eigenspectrum of gram matrix and kernel operator

eigenvalues-eigenvectorsfunctional-analysismachine learningstatistics

A kernel operator is a function $k: \mathbb{R}^2 \to \mathbb{R}$, for instance $k(x, x') = \exp(|x-x'|/2)$ or $k(x, x') = x \cdot x'$. There are many kernels common in statistics and machine learning, and we know they compute some form of “similarity" between $x$ and $x'$. Then, given a dataset $X$, of, say, $n$ data points in $\mathbb{R}^d$, so that $X \in \mathbb{R}^{d \times n}$, we can define a gram matrix by $K(X, X)_{ij} = k(x_i, x_j)$. We know (by Mercer's theorem) that stationary kernels can be eigendecomposed into a spectrum of eigenvalues $\{\lambda_i\}$ and orthonormal eigenfunctions $\phi_i$. My question is, what is known about the relationship between the eigenvalues of the kernel operator $k$ (it may have an infinite spectrum), and the eigenvalues of the kernel gram matrix $K(X, X)$, which will be finite for any finite dataset $X$. In particular, I am wondering how this relationship depends on the distribution of the dataset, $X$, but any information is appreciated.

I am asking with an eye towards trying to use information from the spectrum of the kernel operator $k$ to more cleanly/faster invert the gram matrix $K(X,X)$, so if anything is known about that, feel free to add that, too. I know, of course, that in the infinite data limit $n \to \infty$, we have that the spectra agree, but I am wondering what is known about the preasymptotics or rate of convergence in this limit.

Best Answer

Any PSD kernel $K(\cdot,\cdot)$ induces an associated RKHS $\mathscr{H}$, so we can use the reproducing property to derive the relationship between the spectrum of the kernel matrix K and the kernel operator $\mathscr{K}$.

Let $(\Lambda_i,\psi_i)_{i=1}^{\infty}$ are the eigenfunction/eigenvalue pairs of the kernel operator $$ \begin{align} \mathscr{K} &= \frac{1}{n}\sum_{i=1}^n K(\cdot,x_i) \otimes K(\cdot,x_i) \\ \mathscr{K}(f) &= \frac{1}{n}\sum_{i=1}^n \langle K(\cdot,x_i) f\rangle_{\mathscr{H}}\, K(\cdot,x_i) \end{align} $$ Furthermore, let $K$ be the $n\times n$ kernel matrix defined by $$ \begin{align} K_{ij} = \langle K(\cdot,x_i),\, K(\cdot,x_j)\rangle_\mathscr{H} \end{align}$$ and $(\lambda_i,e_i)_{i=1}^n$ are its eigenvector/eigenvalue pairs. Note that each $\psi_i\in\mathscr{H}$ possibly infinite dimensional, whereas $e_i\in\mathbb{R}^n$. Let's establish the connection between the two.

Being an eigenfunction means that $$ \begin{align} \Lambda_i \psi_i =\mathscr{K}( \psi_i ) &= \frac{1}{n}\sum_{j=1}^n \langle{ K(\cdot,x_j) ,\, \psi_i \rangle}_\mathscr{H} \, K(\cdot,x_j)\\ &=\frac{1}{n}\sum_{j=1}^n \psi_i (x_j) K(\cdot,x_j) \tag{Reproducing property} \\ \implies n\Lambda_i \psi_i &= \sum_{j=1}^n \psi_i (x_j) K(\cdot,x_j) \end{align} $$ so the coefficients of the function $n\Lambda_i \psi_i$ in the subspace of $\mathscr{H}$ spanned by $\left\{K(\cdot,x_i)\right\}_{i=1}^n$ are given by the point evaluations $\left\{ \psi_i (x_i) \right\}_{i=1}^n$. We can hit both sides with $K(\cdot,x_j)$ from $j=1,\dots,n$ to form a vector $$ \begin{align} n\Lambda_i \begin{pmatrix} \psi_i(x_1) \\ \vdots \\ \psi_i(x_n) \end{pmatrix} = n\Lambda_i \begin{pmatrix} \langle K(\cdot,x_1) ,\, \psi_i \rangle_\mathscr{H} \\ \vdots \\ \langle K(\cdot,x_n) ,\, \psi_i \rangle_\mathscr{H} \\ \end{pmatrix} &= \sum_{j=1}^n \left\{ \psi_i (x_j) \begin{pmatrix} \langle K(\cdot,x_1) ,\, K(\cdot,x_j) \rangle_\mathscr{H} \\ \vdots \\ \langle K(\cdot,x_n) ,\, K(\cdot,x_j) \rangle_\mathscr{H} \end{pmatrix}\right\} = K \begin{pmatrix} \psi_i(x_1) \\ \vdots \\ \psi_i(x_n) \end{pmatrix} \end{align} $$ where $K$ is the kernel matrix. Denote $v_i = (\psi_i(x_1), \, \dots ,\, \psi_i(x_n))^\top$, we have $$ \begin{align} K v_i = n\Lambda_i v_i \end{align} $$ and so $\lambda_i = n\Lambda_i$ is an eigenvalue of $K$, and $v_i$ is an eigenvector. For normalization purposes we could also take $e_i = \frac{1}{\sqrt{n}}v_i$ as an eigenvector.

Note that in a general Hilbert space, the dimension can go beyond $n$, but since all functions lie in an $n$-dimensional subspace in the RKHS, there are at most $n$ non-zero eigenvalues.

Lemma 10 and 12 in Ghojogh et al. (2021) and references therein basically give the same result.

Related Question