Optimization Techniques – Handling Slow to Evaluate Cost Functions

bayesian optimizationgradient descentoptimization

Gradient descent and many other methods are useful for finding local minima in cost functions. They can be efficient when the cost function can be evaluated quickly at each point, whether numerically or analytically.

I have what appears to me to be an unusual situation. Each evaluation of my cost function is expensive. I am attempting to find a set of parameters that minimize a 3D surface against ground truth surfaces. Whenever I change a parameter, I need to run the algorithm against the entire sample cohort to measure its effect. In order to calculate a gradient, I need to change all 15 parameters independently, meaning I have to regenerate all the surfaces and compare against the sample cohort way too many times per gradient, and definitely way too many times over the course of optimization.

I have developed a method to circumvent this problem and am currently evaluating it, but I am surprised that I have not found much in the literature regarding expensive cost function evaluations. This makes me wonder if I am making the problem harder than it is and that there might be a better way already available.

So my questions are basically this: Does anyone know of methods for optimizing cost functions, convex or not, when evaluation is slow? Or, am I doing something silly in the first place by rerunning the algorithm and comparing against the sample cohort so many times?

Best Answer

TL;DR

I recommend using LIPO. It is provably correct and provably better than pure random search (PRS). It is also extremely simple to implement, and has no hyperparameters. I have not conducted an analysis that compares LIPO to BO, but my expectation is that the simplicity and efficiency of LIPO imply that it will out-perform BO.

(See also: What are some of the disavantage of bayesian hyper parameter optimization?)

Bayesian Optimization

Bayesian Optimization-type methods build Gaussian process surrogate models to explore the parameter space. The main idea is that parameter tuples that are closer together will have similar function values, so the assumption of a co-variance structure among points allows the algorithm to make educated guesses about what best parameter tuple is most worthwhile to try next. This strategy helps to reduce the number of function evaluations; in fact, the motivation of BO methods is to keep the number of function evaluations as low as possible while "using the whole buffalo" to make good guesses about what point to test next. There are different figures of merit (expected improvement, expected quantile improvement, probability of improvement...) which are used to compare points to visit next.

Contrast this to something like a grid search, which will never use any information from its previous function evaluations to inform where to go next.

Incidentally, this is also a powerful global optimization technique, and as such makes no assumptions about the convexity of the surface. Additionally, if the function is stochastic (say, evaluations have some inherent random noise), this can be directly accounted for in the GP model.

On the other hand, you'll have to fit at least one GP at every iteration (or several, picking the "best", or averaging over alternatives, or fully Bayesian methods). Then, the model is used to make (probably thousands) of predictions, usually in the form of multistart local optimization, with the observation that it's much cheaper to evaluate the GP prediction function than the function under optimization. But even with this computational overhead, it tends to be the case that even nonconvex functions can be optimized with a relatively small number of function calls.

A widely-cited paper on the topic is Jones et al (1998), "Efficient Global Optimization of Expensive Black-Box Functions." But there are many variations on this idea.

Random Search

Even when the cost function is expensive to evaluate, random search can still be useful. Random search is dirt-simple to implement. The only choice for a researcher to make is setting the the probability $p$ that you want your results to lie in some quantile $q$; the rest proceeds automatically using results from basic probability.

Suppose your quantile is $q = 0.95$ and you want a $p=0.95$ probability that the model results are in top $100\times (1-q)=5$ percent of all hyperparameter tuples. The probability that all $n$ attempted tuples are not in that window is $q^n = 0.95^n$ (because they are chosen independently at random from the same distribution), so the probability that at least one tuple is in that region is $1 - 0.95^n$. Putting it all together, we have

$$ 1 - q^n \ge p \implies n \ge \frac{\log(1 - p)}{\log(q)} $$

which in our specific case yields $n \ge 59$.

This result is why most people recommend $n=60$ attempted tuples for random search. It's worth noting that $n=60$ is comparable to the number of experiments required to get good results with Gaussian Process-based methods when there are a moderate number of parameters. Unlike Gaussian Processes, the number of queries tuples does not change with the number of hyperparameters to search over; indeed, for a large number of hyperparameters, a Gaussian process-based method can take many iterations to make headway.

Since you have a probabilistic characterization of how good the results are, this result can be a persuasive tool to convince your boss that running additional experiments will yield diminishing marginal returns.

LIPO and its Variants

This is an exciting arrival which, if it is not new, is certainly new to me. It proceeds by alternating between placing informed bounds on the function, and sampling from the best bound, and using quadratic approximations. I'm still working through all the details, but I think this is very promising. This is a nice blog write-up, and the paper is Cédric Malherbe and Nicolas Vayatis "Global optimization of Lipschitz functions."