Can different kernels be used when performing Gaussian Process Regression

gaussian processkernel trickrbf kernel

Given the equations for exact Gaussian process regression:

\begin{equation}
\bar{\boldsymbol{f}_*} = \boldsymbol{m}(X_*) + K_{*f}(K_{ff} + \sigma^2I_N)^{-1}K_{f*}(\boldsymbol{y} – \boldsymbol{m}(X)),
\end{equation}

\begin{equation}
cov(\bar{\boldsymbol{f}_*}) = K_{**} – K_{*f}(K_{ff} + \sigma^2I)^{-1}K_{f*},
\end{equation}

does $K_{ij}$ have to be the same for each instance shown above? Could the kernel function for $K_{ff}$ be squared exponential (RBF) while the kernel function for $K_{**}$ be something like a Matern' kernel? I don't think that this is possible, but I don't know why. That is what I am hoping to determine, if no, why can't one use different kernels for Gaussian process regression? If yes, how so?

** Clarity Edit **
Can K_** be the result of a completely different kernel function compared to K_*f and K_ff? Because all kernels map to real numbers and are guaranteed to be symmetric and PSD, so for GPs will those attributes still provide a valid covariance matrix given different kernels in the covariance calculation?

Best Answer

Found the answer after much searching. Ultimately, it comes down to the Loewner ordering of $K_{**}$ and $K_{*f}(K_{ff} + \sigma^2I)^{-1}K_{f*}$, the latter of which I will refer to as the product of all those kernel matrices using the notation $K_{*fff*}$. Essentially, the question is: Is the resulting matrix from the subtraction operation between $K_{**}$ and $K_{*fff*}$ PSD? This will only be true when $K_{**} \succeq K_{*fff*}$.

We know that $K_{**}$ is PSD if it came from a valid kernel, which for our purposes we say it has. $K_{*fff*}$ is also PSD because it came from the multiplication of 3 PSD matrices (as long as those matrices came from valid kernels i.e. kernels that follow Mercer's theorem). Therefore, we have two PSD matrices undergoing a subtraction operation, when will this result be PSD?

There are three outcomes that can occur from this subtraction,

The resulting matrix will be positive semi-definite (or PD)
the resulting matrix will be indefinite (neither PSD or NSD)
the resulting matrix will be negative semi-definite (or ND).

If matrix $A$ is PSD, then $x^TAx \ge 0$; $\forall x \in \Re$ when $\Re$ represents all real numbers. We can use variables to represent the vector $x$, so for the 2D case we have something like,

$\begin{bmatrix} x & y \end{bmatrix} \begin{bmatrix} a & b\\ b & c \end{bmatrix} \begin{bmatrix} x\\ y \end{bmatrix} $,

which we can expand to a quadratic equation $ax^2 + bxy + bxy + cy^2 = ax^2 + 2bxy + cy^2$. We know that the squared terms will always be positive, but the middle $2bxy$ term can be negative, what determines if matrix $A$ is PSD is if the squared terms are more positive than the middle term is negative.

Using these ideas we can determine if $K_{**} - K_{*fff*}$ will be PSD or not by using these quadratics. $x^TK_{**}x - x^TK_{*fff*}x \ge 0$ or $x^TK_{**}x \ge x^TK_{*fff}x$; $\forall x \in \Re$. Meaning that $K_{**}$'s quadratic equation in the 2D case must always be strictly larger than or equal to the $K_{*fff*}$ quadratic equation. We can visualize this in 2D with this image.

Note that in this plot, the tan bowl shaped plot in the center (i.e. the smallest one in terms of diameter) is the $K_{**}$ quadratic. The outer red one is the $K_{*fff*}$ quadratic, and the teal bowl is the difference between them (the covariance quadratic). Note that this difference is a valid covariance matrix because it is PSD. If the $K_{**}$ quadratic is not strictly larger than $K_{*fff*}$ quadratic then this is what occurs when subtraction occurs.

Note that $K_{**}$ is not strictly larger and we see that the resulting covariance quadratic dips below zero at those areas where $K_{**}$ is not greater than $K_{*fff*}$. This occurred when the length scale to compute $K_{**}$ was the not the same as the length scales used to compute the resulting $K_{*fff*}$ matrix. We can see in this image the Eigen values of the $K_{**}$ matrix are not greater than the Eigen values of the $K_{*fff*}$ matrix in each of the Eigen vector directions.

This begs the question, can we have a case where we use different length scales (or potentially kernels, I haven't tried it though) of the same kernel for $K_{**}$ and $K_{*fff*}$ and get a PSD matrix. The answer is yes, as long as $K_{**}$'s quadratic is strictly greater than $K_{*fff*}$'s quadratic. Using a variation of the RBF equation

$K(x,x\prime) = \sigma_f e^{-l ||x-x\prime||}$

Note that the usual $\frac{1}{l^2}$ is just being encompassed into the $l$ term. They are equivalent. Using this RBF kernel function, I can give the $K_{ff}$ and $K_{*f}$ kernels the same length scale of 2.338620173077578 and the $K_{**}$ kernel a length scale of 3.3949857315049945, and the resulting covariance is a valid PSD matrix shown in the image below: This is an example of a valid covariance matrix that came from RBF kernels with differing length scales. The reason this can be useful is due to the potential need to describe the similarity between points differently depending on whether you are looking at the test set or training set. I am still unaware of how to choose the length scales purposefully, right now I have only been able to find valid PSD matrices by a random search of length scales. Future work will be devoted to finding a way of choosing differing length scales and knowing the resulting covariance matrix will be valid.

Best Answer

Related Solutions

Solved – Kernels in Gaussian Processes

Notation / setting

How to handle $d$-dimensional inputs

An extra remark about the SE kernel appearing in the question

Solved – Gaussian Process instability with more datapoints

Related Question