Bayesian Optimization Practical Implementation Details – How to Implement Effectively

bayesian optimizationgaussian processoptimization

I'm giving Bayesian Optimization a go, following Snoek, Larochelle, and Adams [http://arxiv.org/pdf/1206.2944.pdf], using GPML [http://www.gaussianprocess.org/gpml/code/matlab/doc/]. I've implemented the Expected Improvement acquisition function described on page 3, and I'm assuming I'm correct that to decide where to next query my objective I should take the $\bf{x}$ that maximizes:

$a_{EI}(\bf{x}; (\bf{x}_n,y_n,\theta))$

But I can't seem to find guidance on what set of candidates $\bf{x}$'s to consider. Theoretically, I'd like to find the best $\bf{x}$ over the entire domain, and the paper is written in such a way that seems to suggest this is possible ("[EI] also has closed form under the Gaussian process"). But as a practical matter, I need to calculate the posterior predictive means and variances at any $\bf{x}^*$ I might consider before I can calculate $a_{EI}(\bf{x}^*)$ and while these posteriors have a closed form, I still need to calculate them using matrix algebra, so I can't see a way to get around picking a bunch of $\bf{x}$'s.

The question: what is a practical method for choosing the large (medium? small?) set of candidate $\bf{x}$'s over which I maximize EI (or any other acquisition function)? (Is this in the paper somewhere and I just missed it?)

At the moment, I'm just taking my current set ${x_i}$, sampling it with replacement 2000 times, and then adding some Gaussian noise to each point. Seems okay, I guess.

Best Answer

The norm is to use any global optimizer you like. The problem is that the EI surface is highly multi-modal and disconnected; optimizing this acquisition function is a nontrivial problem in itself.

A common choice that I have seen in various papers is the DIRECT algorithm; sometimes I've seen CMA-ES which is a state-of-the-art method in nonlinear optimization. In my experience for other forms of optimization, MCS (Multi-Level Coordinate Search) tends to work relatively well. You can find a review of derivative-free global optimizers here:

Rios and Sahinidis, "Derivative-free optimization: a review of algorithms and comparison of software implementations", Journal of Global Optimization (2013).

By the way, the EI is analytical so if you want you can also compute its gradient to guide the optimization, but this is not necessary. An effective technique is to run a global optimizer first to find promising solutions and then run a local optimizer to refine it (e.g., a quasi-Newton method such as BFGS, that is fminunc in MATLAB; or fmincon if you have constraints).

Finally, if speed of the optimization of the acquisition function is a factor (which is not the "traditional" BO scenario), I have found decent results by starting with a Latin Hypercube design or a quasi-random Sobol sequence design, then refined with a few steps of a local optimizer from the best point(s); see also @user777 comment. Since this is not the standard BO scenario, I don't have any specific reference that actually uses this method.

Examples of papers that refer to DIRECT or CMA-ES:

Calandra, R., Seyfarth, A., Peters, J., & Deisenroth, M. P. (2015). Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artificial Intelligence, 1-19 (link).
Mahendran, N., Wang, Z., Hamze, F., & Freitas, N. D. (2012). Adaptive MCMC with Bayesian optimization. In International Conference on Artificial Intelligence and Statistics (pp. 751-760) (link).
Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. J. (2014). Sampling for inference in probabilistic models with fast Bayesian quadrature. In Advances in neural information processing systems (pp. 2789-2797) (link).

You can just google "Bayesian optimization" + the desired global optimization algorithm, and you'll find a bunch of papers. Plus, in pretty much every other paper about BO you would find a sentence such as:

[...] BO usually requires an auxiliary global optimizer in each iteration to optimize the acquisition function. It is customary in the BO literature to use DIvided RECTangles (DIRECT) to accomplish such a task. Other global optimization algorithms like CMA-ES could also be applied.

Best Answer

Related Solutions

Solved – Implementation of Gaussian process

Solved – Understanding the predictive distribution in gaussian linear regression

Related Question