Gaussian Processes – Understanding the Benefits of Gaussian Processes

gaussian process

I have this confusion related to the benefits of Gaussian processes. I mean comparing it to simple linear regression, where we have defined that the linear function models the data.

However, in Gaussian processes we define the distribution of the functions means we don't specifically define that the function should be linear. We can define a prior over the function which is the Gaussian prior which defines features like how much smooth the function should be and all.

So we don't have to explicitly define what the model should be. However, I have questions. We do have marginal likelihood and using it we can tune the covariance function parameters of the gaussian prior. So this is similar to defining the type of function that it should be isn't it.

It boils down to the same thing defining the parameters even though in GP they are hyperparameters. For eg in this paper. They have defined that the mean function of the GP is something like

$$m(x) = ax ^2 + bx + c \quad \text{i.e. a second order polynomial.}$$

So definitely the model/function is defined isn't it. So what's the difference in defining the function to be linear like in the LR.

I just didn't get what the benefit is of using GP

Best Answer

Let's recall some formulas about the Gaussian process regression. Suppose that we have a sample $D = (X,\mathbf{y}) = \{(\mathbf{x}_i, y_i)\}_{i = 1}^N$. For this sample loglikelihood has the form: $$ L = -\frac12 \left( \log |K| + \mathbf{y}^T K^{-1} \mathbf{y}\right), $$ where $K = \{k(\mathbf{x}_i, \mathbf{x}_j)\}_{i, j = 1}^N$ is the sample covariance matrix. There $k(\mathbf{x}_i, \mathbf{x}_j)$ is a covariance function with parameters we tune using the loglikelihood maximization. The prediction (posterior mean) for a new point $\mathbf{x}$ has the form: $$ \hat{y}(\mathbf{x}) = \mathbf{k} K^{-1} \mathbf{y}, $$ there $\mathbf{k} = \{k(\mathbf{x}, \mathbf{x}_i)\}_{i = 1}^N$ is a vector of covariances between new point and sample points.

Now note that Gaussian processes regression can model exact linear models. Suppose that covariance function has the form $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j$. In this case prediction has the form: $$ \hat{y}(\mathbf{x}) = \mathbf{x}^T X^T (X X^T)^{-1} \mathbf{y} = \mathbf{x}^T (X^T X)^{-1} X^T \mathbf{y}. $$ The identity is true in case $(X X^T)^{-1}$ is nonsingular which is not the case, but this is not a problem in case we use covariance matrix regularization. So, the rightest hand side is the exact formula for linear regression, and we can do linear regression with Gaussian processes using proper covariance function.

Now let's consider a Gaussian processes regression with another covariance function (for example, squared exponential covariance function of the form $\exp \left( -(\mathbf{x}_i - \mathbf{x}_j)^T A^{-1} (\mathbf{x}_i - \mathbf{x}_j) \right)$, there $A$ is a matrix of hyperparameters we tune). Obviously, in this case posterior mean is not a linear function (see image).

enter image description here.

So, the benefit is that we can model nonlinear functions using a proper covariance function (we can select a state-of-the-art one, in most cases squared exponential covariance function is a rather good choice). The source of nonlinearity is not the trend component you mentioned, but the covariance function.

Related Question