Solved – Ill-conditioned covariance matrix in GP regression for Bayesian optimization

bayesian optimizationcovariance-matrixgaussian processregression

Background and problem

I am using Gaussian Processes (GP) for regression and subsequent Bayesian optimization (BO). For regression I use the gpml package for MATLAB with several custom-made modifications, but the problem is general.

It is a well-known fact that when two training inputs are too close in input space, the covariance matrix may become not-positive definite (there are several questions about it on this site). As a result, the Cholesky decomposition of the covariance matrix, necessary for various GP computations, may fail due to numerical error. This happened to me in several cases when performing BO with the objective functions I am using, and I'd like to fix it.

Proposed solutions

AFAIK, the standard solution to alleviate ill-conditioning is to add a ridge or nugget to the diagonal of the covariance matrix. For GP regression, this amounts to adding (or increasing, if already present) observation noise.

So far so good. I modified the code for exact inference of gpml so that whenever the Cholesky decomposition fails, I try to fix the covariance matrix to the closest symmetric positive definite (SPD) matrix in Frobenius norm, inspired by this MATLAB code by John d'Errico. The rationale is to minimize intervention on the original matrix.

This workaround does the job, but I noticed that the performance of BO reduced substantially for some functions — possibly whenever the algorithm would need to zoom-in in some area (e.g., because it's getting nearer to the minimum, or because the length scales of the problem become non-uniformly small). This behaviour makes sense since I am effectively increasing the noise whenever two input points get too close, but of course it's not ideal. Alternatively, I could just remove problematic points, but again, sometimes I need the input points to be close.

Question

I don't think that numerical issues with Cholesky factorization of GP's covariance matrices is a novel problem, but to my surprise I couldn't find many solutions so far, aside of increasing the noise or removing points that are too close to each other. On the other hand, it is true that some of my functions are pretty badly behaved, so perhaps my situation is not so typical.

Any suggestion/reference that could be useful here?

Best Answer

Another option is to essentially average the points causing - for example if you have 1000 points and 50 cause issues, you could take the optimal low rank approximation using the first 950 eigenvalues / vectors. However, this isn't far off removing the datapoints close together which you said you would rather not do. Please bear in mind though that as you add jitter you reduce the degrees of freedom - ie each point influences your prediction less, so this could be worse than using less points.

Another option (which I personally think is neat) is to combine the two points in a slights smarter way. You could for instance take 2 points and combine them into one but also use them to determine an approximation for the gradient too. To include gradient information all you need from your kernel is to find $dxk(x,x')$ and $dxdx'k(x,x')$. Derivatives usually have no correlation with their observation so you don't run into conditioning issues and retain local information.

Edit:

Based on the comments I thought I would elaborate what I meant by including derivative observations. If we use a gaussian kernel (as an example),

$k_{x,x'} = k(x, x') = \sigma\exp(-\frac{(x-x')^2}{l^2})$

its derivatives are,

$k_{dx,x'} =\frac{dk(x, x')}{dx} = - \frac{2(x-x')}{l^2} \sigma\exp(-\frac{(x-x')^2}{l^2})$

$k_{dx,dx'} =\frac{d^2k(x, x')}{dxdx'} = 2 \frac{l^2 - 2(x-x')}{l^4} \sigma\exp(-\frac{(x-x')^2}{l^2})$

Now, let us assume we have some data point $\{x_i, y_i ; i = 1,...,n \}$ and a derivative at $x_1$ which I'll call $m_1$.

Let $Y = [m_1, y_1, \dots, y_n]$, then we use a single standard GP with covariance matrix as,

$K = \left( \begin{array}{cccc} k_{dx_0,dx_0} & k_{dx_0,x_0} & \dots & k_{dx_0,x_n} \\ k_{dx_0,x_0} & k_{x_0,x_0} & \dots & k_{x_0,x_n} \\ \vdots & \vdots & \ddots & \vdots \\ k_{dx_0,x_n} & k_{x_0,x_n} & \dots & k_{x_n,x_n} \end{array} \right)$

The rest of the GP is the same as usual.

Related Question