Solved – Difference between Gaussian process regression and other regression techniques (say linear regression)

gaussian processnonlinear regressionnormal distributionregression

I am confused about the differences in the regression techniques available.

Take for example, linear regression. In this case, we construct a model $y = \beta^Tx + \epsilon$ where $\epsilon \sim N(0,\sigma^2)$. In a sense, $y$ then becomes a "Gaussian process" whose mean is $\beta^Tx$ while its covariance function is $k(x,x')=\sigma^2 \mathbb{1}_{x = x'}$.

On the other hand, Gaussian process regression (as in the GP for ML book) is modeled as $y \sim N(m(x),k(x,x'))$ for some kernel/covariance function $k(x,x')$. This type of model is then used to interpolate a given set of data using basis functions which result from the covariance function.

The main difference I see is that the linear regression (or really, generalized regression of this form), creates a model that does not pass through the data points but rather finds the model which has the "best fit". Of course, the predictor need not be linear. On the other hand, Gaussian process regression uses conditioning on Gaussian vectors to find a model that actually passes through the data points.

With this in mind:

  • What really is Gaussian process regression? Can the linear regression with normally distributed $\epsilon$ still be considered Gaussian process regression, as opposed to the Gaussian process regression which interpolates the data (i.e. kriging)? I am confused because Wikipedia shows that Gaussian process regression need not interpolate the data points as shown in the figure here: link.

Can someone help me clarify this confusion?

Best Answer

A Gaussian Process doesn't have to perfectly interpolate between points, as that Wikipedia link shows; it all depends on the covariance function that you use.

For example, consider the GP of the form $X \sim \mathcal N(0, \Sigma_{k_t})$, where $X$ is a vector of a "dependant variables", and $\Sigma_{k_t}$ is a covariance matrix, where every element $\Sigma_{ij} = k(t_i, t_j)$ for some kernel function $k$, and a set of points of the "independent variable" $t$.

If you specify a kernel with the following property: $Cor(x_i, x_j) \to 1$ as $||t_i - t_j|| \to 0$, notice that you are enforcing continuity. Hence, if you simply use such a kernel, for example, the RBF, it must pass through all the points as there's no "noise" here at all.

Instead, if you decide to specify a kernel that does account for noise, for example: $k(t_i, t_j) = RBF(t_i, t_j) + \sigma^2 \mathcal I(t_i =t_j)$ (the WhiteKernel in scikit-learn, also known as the White Noise kernel), then notice that, even if the two $t$s are close, their correlation isn't 1, i.e. there's some noise here. So the function is not expected to be continuous.

In fact, you can interpret using such a kernel as the traditional smooth RBF GP but with a noise term added on top:

$$X \sim \mathcal N(0, \Sigma_{RBF} + \sigma^2 \mathcal I) $$ $$\stackrel d= \mathcal N(0, \Sigma_{RBF}) + \mathcal N(0, \sigma^2 \mathcal I) $$ $$\Rightarrow X = \bar X +\epsilon$$

... where $\bar X$ is now a continuous GP. Notice how similar this is to the linear regression equation - the only difference really is that you're replacing the mean of the linear regression (which is a parametric line) to a non-parametric GP.

Related Question