Solved – Use the RBF kernel to construct a positive definite covariance matrix

covariance-matrixgaussian processmachine learning

A commonly used kernel in Gaussian processes is the RBF kernel:

$$
\kappa(x,x') = \exp\left(-\frac{|| x-x'||^2}{2\sigma^2}\right)
$$

In the context of a Gaussian process, a kernel is used to construct a covariance matrix

$$
\Sigma_{ij} = \kappa(x_i,x_j)
$$

for $x_i$ and $x_j$ in a given space.

Covariance matrices have to be positive definite. However, I ran into the problem when implementing this that they are not always:

xa = xb = Subdivide[-5, 5, 100];
m = Outer[Exp[-0.5 Norm[# - #2]^2] &, xa, xb];
PositiveDefiniteMatrixQ[m]
(* False *)

In this Mathematica code I first took 101 points uniformly from the range $[-5,5]$ and then evaluated $\Sigma_{ij}$ with these points for all $i, j$. I used $\sigma = 1$ in the kernel. The corresponding matrix $\Sigma$ is not positive definite.

I read here that such a covariance matrix is not guaranteed to be positive definite. I also read here that it can be fixed by adding a small constant to the diagonal of the covariance matrix, which seems to be correct.

xa = xb = Subdivide[-5, 5, 100];
m = Outer[Exp[-0.5 Norm[# - #2]^2] &, xa, xb];
m += DiagonalMatrix[ConstantArray[0.0001, 101]];
PositiveDefiniteMatrixQ[m]
(* True *)

My questions are the following:

  1. How is the covariance matrix usually created in practice? It seems to be a very widely used kernel for Gaussian processes, and there should be a standard way of creating the covariance matrix from this kernel in such a way that it is positive definite.
  2. Why does the trick of adding a small constant to the diagonal of the covariance matrix work?
  3. The second post I linked to says that the problem is that since the RBF kernel is very smooth there is a singularity for numbers that are very close to each other. Where does this singularity come from? I know that the numbers will be similar to one another if they are close in input space, but I don't see any singularity.

Best Answer

I think the comment given in the linked answer is incorrect or might be a misunderstanding. In general covariance matrices just need to be positive semi-definite. But the covariance matrix $\Sigma$ constructed in the specific way you did from the RBF kernel function will always be strictly positive definite. This means $x^T \Sigma x > 0$ unless $x=0$. This fact is crucial. Because if your $\Sigma$ were just semi-definite, it would not be invertible. But you need the inverse to solve the interpolation or prediction equation in Gaussian Process regression or interpolation.

So, if $\Sigma$ is positive definite, why did you run into trouble? Most likely you have a numerical problem. The RBF kernel is notorious for being numerically unstable. The reason is its fast (exponential!) decay of eigenvalues. Very small eigenvalues lead to bad condition numbers and numerical problems. Explicit formulas for eigenvalues and eigenfunctions of the RBF kernel can be found in the book Rasmussen-Williams Chapter 4.3.1. In the chapter before they also explain the connection between rapidly decaying eigenvalues and smoothness.

I do not know the algorithm used by Mathematica to determine positive definiteness but any algorithm is doomed to fail if the condition number of $\Sigma$ becomes too large. You can check this yourself, by decreasing the number of points in your grid while checking the condition number or plotting the eigenvalues.

This also explains the widely used "trick" to add a small $\epsilon$ to the diagonal. By increasing the smallest eigenvalue, the condition number of $\Sigma + \epsilon I$ will be (much) smaller than the condition number of $\Sigma$ alone. Again this is something you can easily check while varying grid granularity.

Other kernels are better behaved in this respect. If you really want to do interpolation you should probably choose a different kernel. One example would be a kernel from the Matern family (see Rasmussen-Williams, Formula 4.14)