Solved – Most suitable optimizer for the Gaussain process likelihood function

gaussian processoptimization

The Gaussian process (GP) log Likelihood function can be expressed as
enter image description here

Where K is a positive definite covariance matrix. The hyperparameters can be obtained through maximizing the likelihood function . It is well know that given this non linear unconstrained optimization , local maxima can provide good estimates for the gaussian process, ie.. the method is robust if you cannot find the glocal maximizer.
My question is what is the best method to maximize the GP likelihood. From my experience it seems that gradient based methods work best , however I would like any recommendation on optimizing techniques most fit for the GP, especially techniques available in R

Best Answer

I think this is an open-ended question because a lot will depend on the actual dataset you are optimising against, how close your first candidate solution $s_0$ is to a local optimum and if you are interested / are able to use derivative information or not.

I have used R's standard optim function and generally I have found that the L-BFGS-B algorithm is the fastest or close to the fastest from the default optimisation algorithms available. That is when I supply a derivative function. In the GPML Matlab Code the authors also provide a L-BFGS-B implementation so I suspect they too found that the L-BFGS-B algorithm is reasonably competitive when someone provides derivative information within the context of a general application.

Another option is to use derivative free optimisation. Rios and Sahinidis, 2013 review paper: "Derivative-free optimization: A review of algorithms and comparison of software implementations" in the Journal of Global Optimisation seems to be your best bet for something exhaustive. Within R the minqa package that provide derivative-free optimization by quadratic approximation (QA) routines. The package contains some of Powell's most famous "optimisation children": UOBYQA, NEWUOA and BOBYQA. I have found UOBYQA to be the fastest of the three for toy problems despite Wikipedia general advice: "For general usage, NEWUOA is recommended to replace UOBYQA.". This is not very surprising, log-likelihoods are smooth functions with well-defined derivatives so NEWUOA might not a enjoy an obvious advantage. Again this shows that there is no silver-bullet. On that matter, I have played around with some Particle Swarm Optimisation (PSO) and Covariance Matrix Adaptation Evolution Strategy algorithms included in the R package hydroPSO and cmaes respectively but in general while faster and far more informative than the canned Simulated Annealing (SANN) in optim they were not remotely competitive in terms of speed with QA routines. Notice that estimating the hyper-parameters vector $\theta$ for a log-likelihood function is usually a smooth and (at least locally) convex problem so stochastic optimisation generally will not offer a great advantage.

To recap: I would suggest using L-BFGS-B with derivative information. If derivative information is hard to obtain (eg. due to complicated kernels functions) use quadratic approximation routines.