Solved – Why does a Gaussian Process need to have a PSD kernel? Can I use a non-PSD kernel

covariance-matrixgaussian processkernel trickmachine learningsvm

Is there an absolute need to use a PSD kernel for Gaussian processes (and maybe SVMs?)

For example, If I used a Minkowski distance with 0 < p < 1, the function would is not convex, and thus I would assume that the matrix K would not be positive semi definite.

In Gaussian Process regression, we would need to invert the covariance matrix to do inference. What would it mean to have negative eigenvalues in terms of this inference?

Is there any corrections that can be done so I can use a Minkowski distance function 0 < p < 1?

Best Answer

Say that $X \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot))$.

If $k$ is not a PSD kernel, then there is some set of $n$ points $\{ t_i \}_{i=1}^n$ and corresponding weights $\alpha_i \in \mathbb R$ such that $$\sum_{i=1}^n \sum_{j=1}^n \alpha_i k(t_i, t_j) \alpha_j < 0.$$

Now, consider the joint distribution of $\big(X(t_i) \big)$. By the GP assumption, $$\operatorname{Cov}(X(t_i), X(t_j)) = k(t_i, t_j).$$ But then $$ \operatorname{Var}\left( \sum_{i=1}^n \alpha_i X(t_i) \right) = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \operatorname{Cov}(X(t_i), X(t_j)) \alpha_j < 0 ,$$ which is nonsensical. So a GP with a non-PSD kernel isn't a valid random process. How much that matters to your application depends on what you're doing with it, but the probabilistic foundations are definitely shot.

If you're just running an SVM or doing ridge regression (formally equivalent to GP regression), then the Hilbert space foundations are also definitely shot, but it's possible to define what you're doing in terms of a Krein space. There was a bit of work in the mid-to-late-aughts on this, but I think it was mostly abandoned because the theoretical motivation wasn't super satisfying and neither were the empirical results; I can dig out some of these papers if you want.

Another option is to "patch" your kernel to the closest (in some sense) PSD kernel on the particular points you consider. The following paper studied that; I have also used these techniques, but wouldn't generally recommend them if you can avoid it, because it adds a lot of headaches.

Chen, Garcia, Gupta, Rahimi, and Cazzanti. Similarity-based Classification: Concepts and Algorithms. JMLR 2009 (pdf).

Related Question