Solved – Ill-conditioned covariance matrix in GP regression for Bayesian optimization

bayesian optimizationcovariance-matrixgaussian processregression

Background and problem

I am using Gaussian Processes (GP) for regression and subsequent Bayesian optimization (BO). For regression I use the gpml package for MATLAB with several custom-made modifications, but the problem is general.

It is a well-known fact that when two training inputs are too close in input space, the covariance matrix may become not-positive definite (there are several questions about it on this site). As a result, the Cholesky decomposition of the covariance matrix, necessary for various GP computations, may fail due to numerical error. This happened to me in several cases when performing BO with the objective functions I am using, and I'd like to fix it.

Proposed solutions

AFAIK, the standard solution to alleviate ill-conditioning is to add a ridge or nugget to the diagonal of the covariance matrix. For GP regression, this amounts to adding (or increasing, if already present) observation noise.

So far so good. I modified the code for exact inference of gpml so that whenever the Cholesky decomposition fails, I try to fix the covariance matrix to the closest symmetric positive definite (SPD) matrix in Frobenius norm, inspired by this MATLAB code by John d'Errico. The rationale is to minimize intervention on the original matrix.

This workaround does the job, but I noticed that the performance of BO reduced substantially for some functions — possibly whenever the algorithm would need to zoom-in in some area (e.g., because it's getting nearer to the minimum, or because the length scales of the problem become non-uniformly small). This behaviour makes sense since I am effectively increasing the noise whenever two input points get too close, but of course it's not ideal. Alternatively, I could just remove problematic points, but again, sometimes I need the input points to be close.

Question

I don't think that numerical issues with Cholesky factorization of GP's covariance matrices is a novel problem, but to my surprise I couldn't find many solutions so far, aside of increasing the noise or removing points that are too close to each other. On the other hand, it is true that some of my functions are pretty badly behaved, so perhaps my situation is not so typical.

Any suggestion/reference that could be useful here?

Best Answer

Another option is to essentially average the points causing - for example if you have 1000 points and 50 cause issues, you could take the optimal low rank approximation using the first 950 eigenvalues / vectors. However, this isn't far off removing the datapoints close together which you said you would rather not do. Please bear in mind though that as you add jitter you reduce the degrees of freedom - ie each point influences your prediction less, so this could be worse than using less points.

Another option (which I personally think is neat) is to combine the two points in a slights smarter way. You could for instance take 2 points and combine them into one but also use them to determine an approximation for the gradient too. To include gradient information all you need from your kernel is to find $dxk(x,x')$ and $dxdx'k(x,x')$. Derivatives usually have no correlation with their observation so you don't run into conditioning issues and retain local information.

Edit:

Based on the comments I thought I would elaborate what I meant by including derivative observations. If we use a gaussian kernel (as an example),

$k_{x,x'} = k(x, x') = \sigma\exp(-\frac{(x-x')^2}{l^2})$

its derivatives are,

$k_{dx,x'} =\frac{dk(x, x')}{dx} = - \frac{2(x-x')}{l^2} \sigma\exp(-\frac{(x-x')^2}{l^2})$

$k_{dx,dx'} =\frac{d^2k(x, x')}{dxdx'} = 2 \frac{l^2 - 2(x-x')}{l^4} \sigma\exp(-\frac{(x-x')^2}{l^2})$

Now, let us assume we have some data point $\{x_i, y_i ; i = 1,...,n \}$ and a derivative at $x_1$ which I'll call $m_1$.

Let $Y = [m_1, y_1, \dots, y_n]$, then we use a single standard GP with covariance matrix as,

$K = \left( \begin{array}{cccc} k_{dx_0,dx_0} & k_{dx_0,x_0} & \dots & k_{dx_0,x_n} \\ k_{dx_0,x_0} & k_{x_0,x_0} & \dots & k_{x_0,x_n} \\ \vdots & \vdots & \ddots & \vdots \\ k_{dx_0,x_n} & k_{x_0,x_n} & \dots & k_{x_n,x_n} \end{array} \right)$

The rest of the GP is the same as usual.

Related Solutions

Solved – Making square-root of covariance matrix positive-definite (Matlab)

Here is code I've used in the past (using the SVD approach). I know you said you've tried it already, but it has always worked for me so I thought I'd post it to see if it was helpful.

function [sigma] = validateCovMatrix(sig)

% [sigma] = validateCovMatrix(sig)
%
% -- INPUT --
% sig:      sample covariance matrix
%
% -- OUTPUT --
% sigma:    positive-definite covariance matrix
%

EPS = 10^-6;
ZERO = 10^-10;

sigma = sig;
[r err] = cholcov(sigma, 0);

if (err ~= 0)
    % the covariance matrix is not positive definite!
    [v d] = eig(sigma);

    % set any of the eigenvalues that are <= 0 to some small positive value
    for n = 1:size(d,1)
        if (d(n, n) <= ZERO)
            d(n, n) = EPS;
        end
    end
    % recompose the covariance matrix, now it should be positive definite.
    sigma = v*d*v';

    [r err] = cholcov(sigma, 0);
    if (err ~= 0)
        disp('ERROR!');
    end
end

Solved – Covariance matrix of image data is not positive definite matrix

Your covariance matrix isn't positive definite because you have more dimensions (i.e. pixels per image) than data points. Given $n$ points and $d$ dimensions, the rank of the sample covariance matrix (i.e. number of positive eigenvalues) is at most $\min \{n, d\}$. Since you have $n < d$, you have rank at most $n$. All remaining eigenvalues will be zero (i.e. the covariance matrix will be positive semidefinite). But, an eigensolver might report miniscule positive/negative values instead of true zeros due to numerical issues. This also means that the determinant (i.e. product of the eigenvalues) will be zero.

The rank of the covariance matrix indicates how many dimensions the data span. Geometrically, you have data that lie on a hyperplane of at most $n$ dimensions, embedded in a larger $d$ dimensional space. The eigenvalues of the covariance matrix indicate the variance of the data along the direction of the eigenvectors. Since you have many zero eigenvalues, this means the data is completely flat along these directions.

Because of this flatness, a Gaussian PDF doesn't exist when the covariance matrix isn't full rank, because it would be infinitesimally thin. This would be like trying to define a one-dimensional normal distribution with zero variance.

To work around this issue, you'd have to impose a prior on the covariance matrix that fattens out the flat directions. A common way to do this is to regularize the covariance matrix by adding some small value $\lambda$ to the diagonal: $C \leftarrow C + \lambda I$ (where $I$ is the identity matrix). This adds $\lambda$ to every eigenvalue, which has the effect of declaring that the variance is $\lambda$ along the direction of eigenvectors that formerly had zero variance.

A better option would be to apply dimensionality reduction (e.g. PCA), then simply work in the lower dimensional space (do you really want to deal with 10800 x 10800 covariance matrices?). Since you're working with face images, it's quite possible that your data has much lower dimensionality that $n$ (and if it doesn't, there may not be any hope of proceeding without gathering much more data).

Best Answer

Edit:

Related Solutions

Solved – Making square-root of covariance matrix positive-definite (Matlab)

Solved – Covariance matrix of image data is not positive definite matrix

Related Question