Gaussian Process – Why Does a Valid Kernel Only Have to Be Positive Semi-Definite?

I'm currently concerned with the topic of Gaussian Processes. To compute the covariance matrix of the conditional distribution, we have to invert $(K_{XX})^{-1}$, where $K_{XX}$ is a matrix of a kernel function $k(x_i, x_j)$ evaluated on all pairwise training samples.

For a matrix to always have an inverse, it has to be positive definite (pd.).

However, when we check for the validity of kernels, it is often based on Mercer's Theorem and this seems to only ensure positive semi-definiteness (psd.).
https://las.inf.ethz.ch/courses/introml-s20/tutorials/tutorial5-kernels2.pdf (ETH Zürich, Slide 13).
https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf (Berkley, Page 4).

But positive semi-definite kernels produce positive semi-definite matrices.

Could someone explain where my misconception is?
Are there valid kernels for which we get a psd matrix which may not be invertible and the Gaussian Process fails?

Edit: I know the kernels we classically use are pd as listed here: https://en.wikipedia.org/wiki/Positive-definite_kernel.
This seems to be related to the unanswered post from Is a valid kernel function have to be positive definite?

Best Answer

You have not told us what you consider a valid kernel, so I can't comment on that. But I can give you a reason why one likes to consider positive semi-definite kernels and not just positive definite ones. Just consider the following kernel:

$$ K:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}\quad,\quad K(x,y)=xy.$$

This may be just the canonical inner product on $\mathbb{R}$ but defines a decent real reproducing kernel Hilbert-space (RKHS) and $K$ is its reproducing kernel. The kernel even has a name: linear kernel. But notice that the kernel matrix for the points $x=1$ and $y=2$ is $$ \begin{pmatrix} 1 & 2\\2 & 4\end{pmatrix}$$ hence semi-definite.

This is no accident. Consider the canonical feature map of this RKHS. This is the map which associates with a point $x\in\mathbb{R}$ the function (= vector in the RKHS) $k_x$ with the property that $k_x(z)=K(x,z)=xz.$ It is not difficult to see that for any two points $x,y$ the respective functions $k_x$ and $k_y$ are scalar multiples of each other. Since the canonical features span the RKHS, you can conclude that the feature space is one-dimensional and in a one-dimensional space each Gram-Matrix (=Kernel matrix) has at most rank one.

This generalizes to feature spaces of arbitrary but finite dimension, where the kernel is defined on an infinite set. If your feature space is $n$-dimensional you pick $n+1$ points and the according feature vectors will be linearly dependent and their Gram/Kernel Matrix semi-definite.

To conclude, insisting on positive definite Kernels, would exclude finite dimensional RKHS, which does not make a lot of sense.

And of course if you want all Kernel matrices invertible, on infinite domains of definition, you need a Kernel with infinite dimensional feature space.

Best Answer

Related Solutions

Solved – Does a polynomial kernel with degree less than 1 satisfy Mercer’s condition

Solved – Validating Kernel Functions

Related Question