Question of understanding regarding Bayesian Optimization, Gaussian process and acquisition function

bayesianbayesian optimizationgaussian process

I'm trying to understand Bayesian optimization and I struggle a lot with all the involved methods. Hence, I have some short questions:

  • We start with a a-prior function, which is a gaussian process.
  • A gaussian process is something like a normal distribution but with functions, instead of with variables.
  • A gaussian process is mainly described by the expectation function $ m(x) $ together with a covariance function $ k(x, x') $.
  • The gaussian process is used to optimize an acquisition function.
  • The maximum of the acquisition function determines where to check the next sample.
  • We use the next sample together with the a-priori/gaussian process (which both together form the so-called a-posterior function) to get a "new" a-priori function.

Is this correct? Did I miss something?

Here is just a random plot so it's easier to respond and/or to grasp answers adequately:

Image1:
img

Image2:
img

Best Answer

In addition to Tim's answer, here are some slight nitpicks/clarifications which might assist your intuitive understanding/prevent possible confusion in the future:

We start with a a-prior function, which is a gaussian process.

The Gaussian Process (GP) here is used as a prior over functions we consider as candidates for modelling the objective function, i.e., we use the GP as a surrogate for the function we want to optimise. This is usually because that function is "expensive" (in a certain sense) to query, but also because the GP provides a nice quantification of predictive uncertainty, which is used (by the acquisition function) to inform a search strategy. In other words, the uncertainty predicted by the GP/surrogate is used to inform the search strategy.

A gaussian process is something like a normal distribution but with functions, instead of with variables.

Yes, kind of. Another way of thinking about it is that it's a multivariate normal distribution defined "lazily" over a continuous space: you can pick any two points in the space and the Gaussian Process will "tell you how similar the values at those points are" by way of a mean vector and covariance matrix. The GP is lazy in the sense that it provides a way of calculating the mean vector and covariance matrix for any given combination of points when you want them. By choosing which values have similar values (and by how much), a practitioner can impose "structure" on the space they're modelling.

A gaussian process is mainly described by the expectation function 𝑚(𝑥) together with a covariance function 𝑘(𝑥,𝑥′).

A GP is completely defined ("parameterised") by the mean function and covariance function (the "kernel").

The gaussian process is used to optimize an acquisition function.

Not quite. The acquisition function is a function of the GP's predictive distribution, i.e., to evaluate the acquisition function at a point $x$, you first get the (posterior) predictive distribution of the GP at that point, and then that predictive distribution (the marginal distribution of the GP's predictive distribution at $x$, if you like) goes into the acquisition function to tell you how "useful" (in some sense) the point $x$ is. The acquisition function is usually (but not always) optimised using some gradient based method, because analytical derivatives of most acquisition functions are quite easy to calculate.

The maximum of the acquisition function determines where to check the next sample.

Yes. Following from the above, we pick "the most useful point" as determined by the acquisition function.

We use the next sample together with the a-priori/gaussian process (which both together form the so-called a-posterior function) to get a "new" a-priori function.

What happens here is you query your original "expensive" objective function (the function you're optimising using Bayesian optimisation), add that point to the "data" you have, and repeat the process.

Related Question