Regression – Differences Between Linear Regression and Bayesian Optimization for Black Box Function Optimization

bayesian optimizationmachine learningregression

Goal: I have a function $f(x,y)=z$ (two variables for illustration only) which I know almost nothing about–it has a compact domain which I can determine, it is non-negative, and bounded above. My goal is to find the maximum value $f(x,y)$ takes on its domain. I can query the function and I know the values at a few points: $f(x_1, y_1)=z_1, f(x_2, y_2)=z_2,…, f(x_n, y_n)=z_n$.

Optimization via Linear Regression:

I have used linear regression (and related methods like the lasso) to model functions like this in the past; this is the method that I am very comfortable with. Typically, I might try to find a polynomial (of degree $k$) which relatively accurately models the data.
The end result would be to say that:

$$f(x,y)\approx \beta_1 + \beta_2 x + \beta_3 y + \beta_4 x^2 + \beta_5 xy + \beta_6 y^2 + … + \beta_{m}y^{k}.$$

Determining the coefficients is done by solving the normal equations. The "best degree" for the polynomial might be determined using $k$-fold cross validation. Once the "best" polynomial has been determined, one can choose any of a number of methods to optimize it (Newton's method, gradient descent, hill climbing, etc). Then, if all goes well, the optimal value on the polynomial corresponds to an optimal value of $f(x,y)$.

Bayesian Optimization:

Recently, I was describing this technique (linear regression) to a colleague who asked: "What is the difference between that and Bayesian Optimization?" I've been attempting to answer this question for the last couple days, and so far I don't have anything really conclusive.

After reading this blog, trying to read this one and getting stuck reading this paper I think I understand the following about BO:

Start with a prior (a guess as to what family of functions the black box belongs to… or at least some family of functions which you believe might be able to describe $f$ well). It looks using a Gaussian Process as a prior is popular.
Specify whatever is needed to build your prior (for GP this is a covariance function).
As you go along, you maintain two evolving functions, a "mean" function $m(x,y)$ and a "confidence interval" function $c(x,y)$. Basically, you can predict the value at any point by evaluating $m(x,y)$ and you are confident that $f(x,y)\in [m(x,y)-c(x,y), m(x,y)+c(x,y)]$.
Bayesian optimization provides a systematic way of updating/exploiting the model by alternatively selecting points where the confidence interval is wide (you are very uncertain about the value) or at optimal points on the model $m(x,y)$.

My Question: Is it fair/correct to say that the major difference between Bayesian Optimization and simply optimizing the linear regression model is that Bayesian Optimization provides the systematic way of sampling with the goal of optimizing the model and reducing uncertainty whereas the linear regression model ignores the idea of uncertainty and doesn't explicitly provide a means of updating the model?

Said another way, is the linear regression model an example of what might be a "mean" function $m(x,y)$ in Bayesian Optimization?

If I stick with linear regression, specify how to compute the uncertainty, and provide a means of updating/improving the model (exploit/explore) am I just doing Bayesian Optimization (without calling it that)?

Background: My background is in pure mathematics (graph theory) but my new job has me working with a lot of data (so I'm self learning statistics). With that in mind, my statistics vocabulary is somewhat limited (so I appreciate answers that either avoid technical terminology or briefly explain any technical terminology used)–for example, Monday was the first time I heard/read the word "prior" used in a stats sense and I'm still not 100% sure I understand that correctly.

Best Answer

Your understanding is correct.

BO inherently measures the uncertainty of regions of your search space. And the acquisition function governs the tradeoff between exploring a point in a region with high uncertainty versus exploring further in a region with lower uncertainty, but a higher value.

By contrast, vanilla regression models assume equal variance - while you can locate the maximum of a polynomial model within some box, the search will be excessively local and not have a great exploration-exploitation tradeoff.

But this just repeats what you already know.

Typical mean functions in BO (and GPs generally) are either 0 or another constant, with all of the heavy-lifting is done by the kernel function. This is mostly a computational trick, because in this case predictions are easily made via linear algebra; otherwise, you have to resort to simulation.

The Jones 1998 paper compares GP and polynomial regression on page 464. This is not strictly the same model that you propose (choosing polynomial terms by CV), but it's consistent with your aims.

Related Solutions

Regression – Importance of Optimizing the Correct Loss Function Explored

Say that I am building a linear regression model p for predicting some values $y_1,…,y_n$.

If the data contains a few extreme outliers in the response - or even just one - the MSE fitted equation can be pulled arbitrarily far away from the MAE one.

Consider the simplest regression model (just an intercept, $\alpha$), and following data:

  0.0003 0.0001 0.0002 0.0004 50000 0.0002 0.0004 0.0003 0.0001 0.0003

The MAD solution is $\alpha$ = 0.0003. The MSE solution is 5000.00023.

The MAD of the minimum MAD solution is about 0.0001. The MAD of the minimum MSE solutions is about 5000. You can potentially do very badly, if you use MSE when the criterion is MAD.

Bayesian Optimization – Understanding Gaussian Process and Acquisition Function

In addition to Tim's answer, here are some slight nitpicks/clarifications which might assist your intuitive understanding/prevent possible confusion in the future:

We start with a a-prior function, which is a gaussian process.

The Gaussian Process (GP) here is used as a prior over functions we consider as candidates for modelling the objective function, i.e., we use the GP as a surrogate for the function we want to optimise. This is usually because that function is "expensive" (in a certain sense) to query, but also because the GP provides a nice quantification of predictive uncertainty, which is used (by the acquisition function) to inform a search strategy. In other words, the uncertainty predicted by the GP/surrogate is used to inform the search strategy.

A gaussian process is something like a normal distribution but with functions, instead of with variables.

Yes, kind of. Another way of thinking about it is that it's a multivariate normal distribution defined "lazily" over a continuous space: you can pick any two points in the space and the Gaussian Process will "tell you how similar the values at those points are" by way of a mean vector and covariance matrix. The GP is lazy in the sense that it provides a way of calculating the mean vector and covariance matrix for any given combination of points when you want them. By choosing which values have similar values (and by how much), a practitioner can impose "structure" on the space they're modelling.

A gaussian process is mainly described by the expectation function 𝑚(𝑥) together with a covariance function 𝑘(𝑥,𝑥′).

A GP is completely defined ("parameterised") by the mean function and covariance function (the "kernel").

The gaussian process is used to optimize an acquisition function.

Not quite. The acquisition function is a function of the GP's predictive distribution, i.e., to evaluate the acquisition function at a point $x$, you first get the (posterior) predictive distribution of the GP at that point, and then that predictive distribution (the marginal distribution of the GP's predictive distribution at $x$, if you like) goes into the acquisition function to tell you how "useful" (in some sense) the point $x$ is. The acquisition function is usually (but not always) optimised using some gradient based method, because analytical derivatives of most acquisition functions are quite easy to calculate.

The maximum of the acquisition function determines where to check the next sample.

Yes. Following from the above, we pick "the most useful point" as determined by the acquisition function.

We use the next sample together with the a-priori/gaussian process (which both together form the so-called a-posterior function) to get a "new" a-priori function.

What happens here is you query your original "expensive" objective function (the function you're optimising using Bayesian optimisation), add that point to the "data" you have, and repeat the process.

Best Answer

Related Solutions

Regression – Importance of Optimizing the Correct Loss Function Explored

Bayesian Optimization – Understanding Gaussian Process and Acquisition Function

Related Question