Solved – Why does Bayesian optimization work

bayesian optimizationgaussian processneural networks

Bayesian optimization is used to optimize costly black-box functions. The idea is to use a surrogate model to model the black-box function and then an acquisition function is used to find the next point of evaluation. The goal is to get very close to the optimum values with very few evaluations of the black-box functions.

Generally, Gaussian process or random forest is used as a surrogate model. Since we have very few values/data points of the black-box function, why don't the surrogate models in Bayesian optimization overfit?

Would it make sense to use a deep neural network as a surrogate model? I think no because a deep neural network would probably overfit with so few data points.

Best Answer

Gaussian Process models have two nice properties in the Bayesian Optimization context:

  1. They can exactly fit the observed values of the black-box function $f$ (when there is no noise, at least)
  2. They smoothly interpolate between observed data points, with increasing statistical uncertainty the farther you move away from the observed data. The nature of this transition is predicable and known.

These properties are essentially exactly what you'd want in a Bayesian optimization context: you know what the function values are (quality 1); but you're less certain about what happens between or far away from function evaluations (quality 2).

(Deep) neural nets don't necessarily have either quality. Regarding (quality 1), it can be hard to get an NN to model a specific function without doing a lot of tedious work: attempting different model configurations and hyper-parameter tuning.

Aside: Usually, the original problem that you're working on is, itself, a problem involving hyper-parameter tuning in its own right, such as selecting the optimal parameters for a SVM or perhaps even a NN. It would be more than a little circular to use yet another neural network to try and solve the original problem, since this new problem could be just as challenging to solve. I'd prefer to have hyperparameter tuning be a fully-automated procedure; the challenges of fitting yet another neural network seem to replace the original problem (optimization of a challenging, possibly expensive function) with another problem of similar complexity (designing a good neural network).

Regarding quality 2, there is no guarantee that the NN has reasonable interpolation behavior between the observed data, or reasonable extrapolation beyond observed values. An NN that fits the data exactly can still have too many degrees of freedom, so the interpolations may be erratic. When a Gaussian Process with a constant mean function (a standard GP model used in BO) makes a prediction sufficiently far away from the observed data, the prediction reverts to the constant mean, and the variance reverts to the prior variance. By contrast, the NN can increase or decrease beyond the extrema of the training data, or oscillate in a complex pattern, or any number of other possibilities. Even if the final NN activation is bounded, these fluctuations can still flip the prediction from one extreme to the other between or beyond the training data; this doesn't seem reasonable. The Gaussian process behavior seems more plausible.

It's worth noting that GPs aren't a perfect solution, either. The Gaussian process's behavior is governed, in a very specific and obvious way, by the parameters of the kernel. For example, the standard RBF kernel $$ k(x, y) = \exp(-\gamma ||x - y||_2^2) $$ is characterized by the "length-scale" parameter $\gamma > 0$. The choice of $\gamma$ strongly influences the predictions that you make: set it too large or too small and the predicted behavior will either have sharp swings or flat behavior between the observed data. A core challenge of building a good surrogate model is how to select the GP hyper-parameters.

GPs give a cheap obvious way to query both the expectation (the GP's estimated value of the point) and variance (the GP's uncertainty around the estimated value) of any point. This is important because BO acquisition functions often incorporate estimates of variance as a component of evaluating which points to evaluate next with the expensive black-box function.

Related Question