Solved – Gaussian process and Correlation

correlationcovariancegaussian processmachine learning

I am have been wondering why people use Gaussian processes (GP) to model an unknown (sometimes deterministic) function. For instance consider an unknown function $y=f(x)$. We have three independent observations from this function :
$$\big(x_1,y_1); \big(x_2,y_2); \big(x_3,y_3) $$

To learn the underlying function, the GP is a common non-parametric technique that treats all outputs as a common multivariate normal distribution. Assume a specific covariance function $K(x_i,y_i)$ and assume:$$\mathbf{y}=(y_1,y_2,y_3);\mathbf{X}=(x_1,x_2,x_3)$$
The GP takes the following form
$$\\
\bf{y}|X
\sim N\Bigg(\mathbf{0},\begin{bmatrix}
K(x_1,x_1) & K(x_1,x_2) & K(x_1,x_3) \\
K(x_1,x_2) & K(x_2,x_2) & K(x_2,x_3) \\
K(x_1,x_3) & K(x_2,x_3) & K(x_3,x_3) \
\end{bmatrix}\Bigg)\\$$

The observations $\big(x_i,y_i)$ are independent. Their only commonality is that they come from the same underlying function.

My main question is: Why are we forcing $\big(x_i,y_j)$ and $\big(x_{l},y_{m})$ to be correlated ? Isn't that the wrong model ? Why can we assume that we can get good prediction results for any $y|x$.

I am not sure what aspect I am missing in this problem and why is forcing correlation helping.

Best Answer

Choosing a kernel is equivalent to choosing a class of functions from which you will choose your model. If choosing a kernel feels like a big thing that encodes a lot of assumptions, that's because it is! People new to the field often don't think much about the choice of kernel and just go with the Gaussian kernel even if it's not appropriate.

How do we decide whether or not a kernel seems appropriate? We need to think about what the functions in the corresponding function space look like. The Gaussian kernel corresponds to very smooth functions, and when that kernel is chosen the assumption is being made that smooth functions will provide a decent model. That's not always the case, and there are tons of other kernels that encode different assumptions about what you want your function class to look like. There are kernels for modeling periodic functions, non-stationary kernels, and a whole host of other things. For example, the smoothness assumption encoded by the Gaussian kernel is not appropriate for text classification, as shown by Charles Martin in his blog here.

Let's look at examples of functions from spaces corresponding to two different kernels. The first will be the Gaussian kernel $k_1(x, x') = \exp(-\gamma |x - x'|^2)$ and the other will be the Brownian motion kernel $k_2(x, x') = \min \{x, x'\}$. A single random draw from each space looks like the following:

k1

k2

Clearly these represent very different assumptions about what a good model is.

Also, note that we aren't necessarily forcing correlation. Take your mean function to be $\mu(x) = x^T \beta$ and your covariance function to be $k(x_i, x_j) = \sigma^2 \mathbf 1(i = j)$. Now our model is $$ Y | X \sim \mathcal N(X\beta, \sigma^2 I) $$ i.e. we've just recovered linear regression.

But in general this correlation between nearby points is an extremely useful and powerful model. Imagine that you own a oil drilling company and you want to find new oil reserves. It's extremely expensive to drill so you want to drill as few times as possible. Let's say we've drilled $n=5$ holes and we want to know where our next hole should be. We can imagine that the amount of oil in the earth's crust is smoothly varying, so we will model the amount of oil in the entire area that we're considering drilling in with a Gaussian process using the Gaussian kernel, which is how we're saying that really close places will have really similar amounts of oil, and really far apart places are effectively independent. The Gaussian kernel is also stationary, which is reasonable in this case: stationarity says that the correlation between two points only depends on the distance between them. We can then use our model to predict where we should next drill. We've just done a single step in a Bayesian optimization, and I find this to be a very good way to intuitively appreciate why we like the correlation aspect of GPs.

Another good resource is Jones et al. (1998). They don't call their model a Gaussian process, but it is. This paper gives a very good feel for why we want to use the correlation between nearby points even in a deterministic setting.

One final point: I don't think anyone ever assumes that we can get good prediction results. That's something we'd want to verify, such as by cross validation.

Update

I want to clarify the nature of the correlation that we're modeling. First let's consider linear regression so $Y | X \sim \mathcal N(X\beta, \sigma^2 I)$. Under this model we have $Y_i \perp Y_j | X$ for $i \neq j$. But we also know that if $||x_1 - x_2||^2 < \varepsilon$ then $$(E(Y_1 | X) - E(Y_2 | X))^2 = (x_1^T \beta - x_2^T \beta)^2 = \langle x_1 - x_2, \beta \rangle^2 \leq || x_1 - x_2||^2 ||\beta ||^2 < \varepsilon ||\beta ||^2. $$

So this tells us that if inputs $x_1$ and $x_2$ are very close then the means of $Y_1$ and $Y_2$ are very close. This is different from being correlated because they're still independent, as evidenced by how $$P(Y_1 > E(Y_1 | X) \ \vert \ Y_2 > E(Y_2 | X)) = P(Y_1 > E(Y_1 | X)).$$

If they were correlated then knowing that $Y_2$ is above its mean would tell us something about $Y_1$.

So now let's keep $\mu(x) = x^T \beta$ but we'll add correlation by $Cov(Y_i, Y_j) = k(x_i, x_j)$. We still have the same result that $||x_1 - x_2||^2 < \varepsilon \implies (E(Y_1 | X) - E(Y_2 | X))^2$ is small, but now we've gained the fact that if $Y_1$ is larger than its mean, say, then likely $Y_2$ will be too. This is the correlation that we've added.

Related Question