I'm giving Bayesian Optimization a go, following Snoek, Larochelle, and Adams [http://arxiv.org/pdf/1206.2944.pdf], using GPML [http://www.gaussianprocess.org/gpml/code/matlab/doc/]. I've implemented the Expected Improvement acquisition function described on page 3, and I'm assuming I'm correct that to decide where to next query my objective I should take the $\bf{x}$ that maximizes:
$a_{EI}(\bf{x}; (\bf{x}_n,y_n,\theta))$
But I can't seem to find guidance on what set of candidates $\bf{x}$'s to consider. Theoretically, I'd like to find the best $\bf{x}$ over the entire domain, and the paper is written in such a way that seems to suggest this is possible ("[EI] also has closed form under the Gaussian process"). But as a practical matter, I need to calculate the posterior predictive means and variances at any $\bf{x}^*$ I might consider before I can calculate $a_{EI}(\bf{x}^*)$ and while these posteriors have a closed form, I still need to calculate them using matrix algebra, so I can't see a way to get around picking a bunch of $\bf{x}$'s.
The question: what is a practical method for choosing the large (medium? small?) set of candidate $\bf{x}$'s over which I maximize EI (or any other acquisition function)? (Is this in the paper somewhere and I just missed it?)
At the moment, I'm just taking my current set ${x_i}$, sampling it with replacement 2000 times, and then adding some Gaussian noise to each point. Seems okay, I guess.
Best Answer
The norm is to use any global optimizer you like. The problem is that the EI surface is highly multi-modal and disconnected; optimizing this acquisition function is a nontrivial problem in itself.
A common choice that I have seen in various papers is the DIRECT algorithm; sometimes I've seen CMA-ES which is a state-of-the-art method in nonlinear optimization. In my experience for other forms of optimization, MCS (Multi-Level Coordinate Search) tends to work relatively well. You can find a review of derivative-free global optimizers here:
By the way, the EI is analytical so if you want you can also compute its gradient to guide the optimization, but this is not necessary. An effective technique is to run a global optimizer first to find promising solutions and then run a local optimizer to refine it (e.g., a quasi-Newton method such as BFGS, that is fminunc in MATLAB; or fmincon if you have constraints).
Finally, if speed of the optimization of the acquisition function is a factor (which is not the "traditional" BO scenario), I have found decent results by starting with a Latin Hypercube design or a quasi-random Sobol sequence design, then refined with a few steps of a local optimizer from the best point(s); see also @user777 comment. Since this is not the standard BO scenario, I don't have any specific reference that actually uses this method.
Examples of papers that refer to DIRECT or CMA-ES:
You can just google "Bayesian optimization" + the desired global optimization algorithm, and you'll find a bunch of papers. Plus, in pretty much every other paper about BO you would find a sentence such as: