Regression – Differences Between Linear Regression and Bayesian Optimization for Black Box Function Optimization

bayesian optimizationmachine learningregression

Goal: I have a function $f(x,y)=z$ (two variables for illustration only) which I know almost nothing about–it has a compact domain which I can determine, it is non-negative, and bounded above. My goal is to find the maximum value $f(x,y)$ takes on its domain. I can query the function and I know the values at a few points: $f(x_1, y_1)=z_1, f(x_2, y_2)=z_2,…, f(x_n, y_n)=z_n$.

Optimization via Linear Regression:

I have used linear regression (and related methods like the lasso) to model functions like this in the past; this is the method that I am very comfortable with. Typically, I might try to find a polynomial (of degree $k$) which relatively accurately models the data.
The end result would be to say that:

$$f(x,y)\approx \beta_1 + \beta_2 x + \beta_3 y + \beta_4 x^2 + \beta_5 xy + \beta_6 y^2 + … + \beta_{m}y^{k}.$$

Determining the coefficients is done by solving the normal equations. The "best degree" for the polynomial might be determined using $k$-fold cross validation. Once the "best" polynomial has been determined, one can choose any of a number of methods to optimize it (Newton's method, gradient descent, hill climbing, etc). Then, if all goes well, the optimal value on the polynomial corresponds to an optimal value of $f(x,y)$.

Bayesian Optimization:

Recently, I was describing this technique (linear regression) to a colleague who asked: "What is the difference between that and Bayesian Optimization?" I've been attempting to answer this question for the last couple days, and so far I don't have anything really conclusive.

After reading this blog, trying to read this one and getting stuck reading this paper I think I understand the following about BO:

  1. Start with a prior (a guess as to what family of functions the black box belongs to… or at least some family of functions which you believe might be able to describe $f$ well). It looks using a Gaussian Process as a prior is popular.

  2. Specify whatever is needed to build your prior (for GP this is a covariance function).

  3. As you go along, you maintain two evolving functions, a "mean" function $m(x,y)$ and a "confidence interval" function $c(x,y)$. Basically, you can predict the value at any point by evaluating $m(x,y)$ and you are confident that $f(x,y)\in [m(x,y)-c(x,y), m(x,y)+c(x,y)]$.

  4. Bayesian optimization provides a systematic way of updating/exploiting the model by alternatively selecting points where the confidence interval is wide (you are very uncertain about the value) or at optimal points on the model $m(x,y)$.

My Question: Is it fair/correct to say that the major difference between Bayesian Optimization and simply optimizing the linear regression model is that Bayesian Optimization provides the systematic way of sampling with the goal of optimizing the model and reducing uncertainty whereas the linear regression model ignores the idea of uncertainty and doesn't explicitly provide a means of updating the model?

Said another way, is the linear regression model an example of what might be a "mean" function $m(x,y)$ in Bayesian Optimization?

If I stick with linear regression, specify how to compute the uncertainty, and provide a means of updating/improving the model (exploit/explore) am I just doing Bayesian Optimization (without calling it that)?

Background: My background is in pure mathematics (graph theory) but my new job has me working with a lot of data (so I'm self learning statistics). With that in mind, my statistics vocabulary is somewhat limited (so I appreciate answers that either avoid technical terminology or briefly explain any technical terminology used)–for example, Monday was the first time I heard/read the word "prior" used in a stats sense and I'm still not 100% sure I understand that correctly.

Best Answer

Your understanding is correct.

BO inherently measures the uncertainty of regions of your search space. And the acquisition function governs the tradeoff between exploring a point in a region with high uncertainty versus exploring further in a region with lower uncertainty, but a higher value.

By contrast, vanilla regression models assume equal variance - while you can locate the maximum of a polynomial model within some box, the search will be excessively local and not have a great exploration-exploitation tradeoff.

But this just repeats what you already know.

Typical mean functions in BO (and GPs generally) are either 0 or another constant, with all of the heavy-lifting is done by the kernel function. This is mostly a computational trick, because in this case predictions are easily made via linear algebra; otherwise, you have to resort to simulation.

The Jones 1998 paper compares GP and polynomial regression on page 464. This is not strictly the same model that you propose (choosing polynomial terms by CV), but it's consistent with your aims.

Related Question