Solved – Hyperparameter estimation in Gaussian process

gaussian processhyperparameteroptimization

I am trying to optimize the hyperparameters for a Gaussian process. As a starter I choose the squared exponential function for covariance where iI have to optimize 3 parameters $\sigma_f$, $\sigma_n$ and the length parameter $l$.

$$k_y(x_p,x_q) = \sigma^2_f \exp\left(-\frac{1}{2l^2}(x_p-x_q)^2\right) + \sigma^2_n\delta_{pq}$$

As described in [1] I try to maximize the marginal likelihood using the parameters. To achieve this I use Commons Math optimizers and compute the gradients of the marginal likelihood function.

Everything works fine except for a tiny example if I only use $ \sigma_f $ and $ l$ for the optimization. As soon as I try to optimize $ \sigma_n$ as well, I run into numerical problems and the the covariance matrix of the process is NOT positive-definite anymore after a few iterations of the optimizer and thus produces INFINITY or NAN errors within the optimizer.

Can anyone explain that behavior and the connection to the hyperparameter $ \sigma_n$?

[1] Gaussian Processes for Machine Learning
Carl Edward Rasmussen and Christopher K. I. Williams
The MIT Press, 2006. ISBN 0-262-18253-X.

Best Answer

It would be a good idea to get the optimisation code to print out the hyper-parameters each time it performs a function evaluation. Usually you can work out what is going wrong once you know what the model decides about the hyper-parameter values.

I suspect what is happening is that the model has decided that an essentially linear classifier would be best, and has set $\sigma_n$ to zero, as the noise component is not seen as necessary (which is a shame as it adds a ridge to the covariance matrix, which helps ensure that it is p.d.). In that case, only the SEiso bit is used, so $\sigma_f$ will probably be much larger than $\sigma_n$, however to make a linear classifier it will try and make $l$ as large as possible, which seems to end up resulting in numerical problems when evaluating the bit inside the exponential. I'm a pretty heavy user of GPML and have seen this a fair bit. One solution is to limit the magnitudes of the logarithms of the hyper-parameters during the search (which is equivalent to having a hyper-prior on the hyper-parameters), which tends to prevent this from happening. If you print out the values of the hyper-parameters, then the last ones that get printed before it goes "bang" will give you a good idea where to place the limits. This tends not to affect performance very much, the generalisation error at such points in hyper-parameter tends to be fairly flat, which causes gradient descent methods to take large steps that put you far enough from the origin that you get numerical accuracy problems.

In short print out the hyper-parameter values at each step, whenever you run into numerical issues in model selection.