Solved – Why does Bayesian Optimization perform poorly in more than 20 Dimensions

bayesianmachine learningneural networksnormal distributionoptimization

I have been studying Bayesian Optimization lately and made the following notes about this topic:

  • Unlike deterministic functions, real world functions are constructed using physical measurements

  • Measurements can always have error (e.g. human error, design and experiment error, random error, measurement error) – if you record the same measurements at the same conditions but at a future time, it's very likely that these measurements might be different from the previous measurements

  • Thus, an objective function that is based on physical measurements is naturally "unstable" – two people might record the same measurements, end up with different values, and as a result end up with two different objective functions.

  • A "noisy" function is also an "unstable" function – if we were top optimize this "noisy function", the optimization results might not correspond to natural system we are studying due to inherent errors while recording measurements. This means that in some sense, we are dealing with a more complicated version of "apples and oranges".

  • Bayesian Optimization attempts to solve this problem by using the recorded measurements as "pegs" and fitting a "circus tent" over these measurements through the form of a Gaussian Process. This sort of acts like "probabilistic smoothing" and tries to statistically account for all possible uncaptured variations in the measurements that exist – provided the assumption of the "data generating process" being well represented by a Gaussian Process is somewhat true.

  • Thus, Bayesian Optimization tries to "smoothens out" the noise/variation/error in the objective function, adding a natural "robustness" to the final optimization solution. All this means is that Bayesian Optimization has the potential to give us better results.

Advantages of Bayesian Optimization:

  • Robustness (as descibed above)
  • Does not require the objective function to be differentiable (i.e. useful in discrete and combinatorial optimization problems)
  • Since it does not calculate the derivative, it has the potential to be more "computationally efficient" compared to gradient based optimization methods.

Disadvantages of Bayesian Optimization:

  • Requires the true objective function to be well modelled by a Gaussian Process
  • Empirically has been observed to perform poorly on high dimensional objective functions (i.e. higher than 20 dimensions) – however, I don't understand why.

I have often heard this claim being made about Bayesian Optimization performing poorly in more than 20 dimensions, but I have never been able to understand why this is. I tried to consult some references online:

1) "High-Dimensional Bayesian Optimization with Sparse
Axis-Aligned Subspaces" (Eriksson et al 2021)

  • "High-dimensional BO presents a particular challenge, in part because the curse of dimensionality makes it difficult to define—as well as do inference over—a suitable class of surrogate models."

  • "While BO has become a workhorse algorithm that is employed in a wide variety of settings, successful applications are often limited to low-dimensional problems, e.g. fewer
    than twenty dimensions [Frazier, 2018]. Applying BO to high-dimensional problems remains a significant challenge. The difficulty can be traced to both of the algorithm components mentioned above, although we postulate that suitable function priors are especially important for good performance. In particular, in order for BO to be sample-efficient in high-dimensional spaces, it is crucial to define surrogate models that are sufficiently parsimonious that they can be inferred from a small number of query points. An overly
    flexible class of models is likely to suffer from overfitting, which severely limits its effectiveness in decision-making. Likewise, an overly rigid class of models is unlikely to
    capture enough features of the objective function. A compromise between flexibility and parsimony is essential."

2) "High-dimensional Bayesian optimization using low-dimensional feature spaces" (Moriconi et al, 2020)

  • "However, BO (Bayesian Optimization) is practically limited to optimizing 10–20 parameters. To scale BO to high dimensions,
    we usually make structural assumptions on the decomposition of the objective
    and/or exploit the intrinsic lower dimensionality of the problem, e.g. by using
    linear projections. We could achieve a higher compression rate with nonlinear
    projections, but learning these nonlinear embeddings typically requires much
    data. This contradicts the BO objective of a relatively small evaluation budget."

3) "A Tutorial on Bayesian Optimization" (Frazier, 2018)

  • "It (Bayesian Optimization) is best-suited for optimization over continuous domains of less than 20"

  • "The input x is in R-d for a value of d that is not too large. Typically d ≤ 20 in most successful applications of BayesOpt."

My Question : No where in these papers do they explain why "20 Dimensions" seems to be a relevant threshold for deciding the conditions in which the performance of Bayesian Optimization begins to deteriorate.

  • Can someone please explain why "20 Dimensions" is said to be the maximum threshold for Bayesian Optimization?

  • Even though some explanations are provided that explain the difficulty of Bayesian Optimization in higher dimensions – can someone please help me understand this in more detail?

References:
High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces (PDF)
A Tutorial on Bayesian Optimization (PDF)
High-dimensional Bayesian optimization using low-dimensional feature spaces (PDF)

Best Answer

To be completely honest, it's because everything performs poorly in more than 20 dimensions. Bayesian optimization isn't special here.

Trying to optimize any function in a lot of dimensions is hard, because the volume of a high-dimensional space goes up exponentially with the number of dimensions. Consider a line segment on $[0, k]$; that has length $k$. A unit square? That has area $k^2$. And so on. So the amount of space that you have to search when you're looking for a possible solution goes up very, very, fast. I recommend looking up the term "Curse of Dimensionality" for more.

This will always be true, regardless of what algorithm you use -- unless you're willing to make some strong simplifying assumptions about the shape of that function. For example, gradient descent can do quite well in high dimensions -- as long as your function is differentiable. If you have a function where the gradient is 0 somewhere besides the minimum, you're screwed.

Bayesian optimization is exactly the same. The papers you've linked point out that if your function has an interesting structure, you can exploit this by picking good priors. Namely, you need to assume sparsity (that only a few of those dimensions are important and the rest can be ignored), or differentiability, in which case you can use gradient-enhanced Gaussian processes. But if you don't have that structure, you're screwed.

As for 20 dimensions, that's a rule of thumb. There's no "threshold" or anything, but it gets hard exponentially quickly.

Related Question