Solved – Gaussian process and Correlation

correlationcovariancegaussian processmachine learning

I am have been wondering why people use Gaussian processes (GP) to model an unknown (sometimes deterministic) function. For instance consider an unknown function $y=f(x)$. We have three independent observations from this function :
$$\big(x_1,y_1); \big(x_2,y_2); \big(x_3,y_3) $$

To learn the underlying function, the GP is a common non-parametric technique that treats all outputs as a common multivariate normal distribution. Assume a specific covariance function $K(x_i,y_i)$ and assume:$$\mathbf{y}=(y_1,y_2,y_3);\mathbf{X}=(x_1,x_2,x_3)$$
The GP takes the following form
$$\\
\bf{y}|X
\sim N\Bigg(\mathbf{0},\begin{bmatrix}
K(x_1,x_1) & K(x_1,x_2) & K(x_1,x_3) \\
K(x_1,x_2) & K(x_2,x_2) & K(x_2,x_3) \\
K(x_1,x_3) & K(x_2,x_3) & K(x_3,x_3) \
\end{bmatrix}\Bigg)\\$$

The observations $\big(x_i,y_i)$ are independent. Their only commonality is that they come from the same underlying function.

My main question is: Why are we forcing $\big(x_i,y_j)$ and $\big(x_{l},y_{m})$ to be correlated ? Isn't that the wrong model ? Why can we assume that we can get good prediction results for any $y|x$.

I am not sure what aspect I am missing in this problem and why is forcing correlation helping.

Best Answer

Choosing a kernel is equivalent to choosing a class of functions from which you will choose your model. If choosing a kernel feels like a big thing that encodes a lot of assumptions, that's because it is! People new to the field often don't think much about the choice of kernel and just go with the Gaussian kernel even if it's not appropriate.

How do we decide whether or not a kernel seems appropriate? We need to think about what the functions in the corresponding function space look like. The Gaussian kernel corresponds to very smooth functions, and when that kernel is chosen the assumption is being made that smooth functions will provide a decent model. That's not always the case, and there are tons of other kernels that encode different assumptions about what you want your function class to look like. There are kernels for modeling periodic functions, non-stationary kernels, and a whole host of other things. For example, the smoothness assumption encoded by the Gaussian kernel is not appropriate for text classification, as shown by Charles Martin in his blog here.

Let's look at examples of functions from spaces corresponding to two different kernels. The first will be the Gaussian kernel $k_1(x, x') = \exp(-\gamma |x - x'|^2)$ and the other will be the Brownian motion kernel $k_2(x, x') = \min \{x, x'\}$. A single random draw from each space looks like the following:

Clearly these represent very different assumptions about what a good model is.

Also, note that we aren't necessarily forcing correlation. Take your mean function to be $\mu(x) = x^T \beta$ and your covariance function to be $k(x_i, x_j) = \sigma^2 \mathbf 1(i = j)$. Now our model is $$ Y | X \sim \mathcal N(X\beta, \sigma^2 I) $$ i.e. we've just recovered linear regression.

But in general this correlation between nearby points is an extremely useful and powerful model. Imagine that you own a oil drilling company and you want to find new oil reserves. It's extremely expensive to drill so you want to drill as few times as possible. Let's say we've drilled $n=5$ holes and we want to know where our next hole should be. We can imagine that the amount of oil in the earth's crust is smoothly varying, so we will model the amount of oil in the entire area that we're considering drilling in with a Gaussian process using the Gaussian kernel, which is how we're saying that really close places will have really similar amounts of oil, and really far apart places are effectively independent. The Gaussian kernel is also stationary, which is reasonable in this case: stationarity says that the correlation between two points only depends on the distance between them. We can then use our model to predict where we should next drill. We've just done a single step in a Bayesian optimization, and I find this to be a very good way to intuitively appreciate why we like the correlation aspect of GPs.

Another good resource is Jones et al. (1998). They don't call their model a Gaussian process, but it is. This paper gives a very good feel for why we want to use the correlation between nearby points even in a deterministic setting.

One final point: I don't think anyone ever assumes that we can get good prediction results. That's something we'd want to verify, such as by cross validation.

Update

I want to clarify the nature of the correlation that we're modeling. First let's consider linear regression so $Y | X \sim \mathcal N(X\beta, \sigma^2 I)$. Under this model we have $Y_i \perp Y_j | X$ for $i \neq j$. But we also know that if $||x_1 - x_2||^2 < \varepsilon$ then $$(E(Y_1 | X) - E(Y_2 | X))^2 = (x_1^T \beta - x_2^T \beta)^2 = \langle x_1 - x_2, \beta \rangle^2 \leq || x_1 - x_2||^2 ||\beta ||^2 < \varepsilon ||\beta ||^2. $$

So this tells us that if inputs $x_1$ and $x_2$ are very close then the means of $Y_1$ and $Y_2$ are very close. This is different from being correlated because they're still independent, as evidenced by how $$P(Y_1 > E(Y_1 | X) \ \vert \ Y_2 > E(Y_2 | X)) = P(Y_1 > E(Y_1 | X)).$$

If they were correlated then knowing that $Y_2$ is above its mean would tell us something about $Y_1$.

So now let's keep $\mu(x) = x^T \beta$ but we'll add correlation by $Cov(Y_i, Y_j) = k(x_i, x_j)$. We still have the same result that $||x_1 - x_2||^2 < \varepsilon \implies (E(Y_1 | X) - E(Y_2 | X))^2$ is small, but now we've gained the fact that if $Y_1$ is larger than its mean, say, then likely $Y_2$ will be too. This is the correlation that we've added.

Related Solutions

Solved – Predicting multiple output variables based on multiple input variables

You can use Multivariate Multiple Linear Regressior like blow:

fit <- lm(cbind(Y1, Y2) ~ X1 + X2 + X3, data=train_set)

Gaussian Process Analysis – Why Compute the Covariance Function for Inputs?

Remember that $m(x_i)$ and $\kappa(x_i,x_j)$ are formally defined as : $$m(x_i) = \mathbb E[f(x_i)],\quad \kappa(x_i,x_j)=\mathbb E[(f(x_i)-m(x_i))(f(x_j)-m(x_j))]$$ So, as we can see from the definition, $\kappa(x_i,x_j)$ is equal to $\mathrm{Cov}[f(x_i),f(x_j)]$. It is thus a measure of how much $f(x_i)$ and $f(x_j)$ are correlated with one another, as intended : it represents how much the value at $x=x_i$ will correlate with the value at $x = x_j$.

You might be confused by the fact that $y_i$ and $y_j$ seemingly don't appear in the expression of the covariance function. First notice that $\mathrm{Cov}[f(x_i),f(x_j)] = \mathrm{Cov}[y_i,y_j]$ simply because $$\begin{align}\mathrm{Cov}[y_i,y_j] &= \mathrm{Cov}[f(x_i)+\epsilon_i,f(x_j)+\epsilon_j]\\ &=\mathrm{Cov}[f(x_i),f(x_j)] + \mathrm{Cov}[f(x_i),\epsilon_j]\\ &+\mathrm{Cov}[\epsilon_i, f(x_j)] + \mathrm{Cov}[\epsilon_i, \epsilon_j] \\ &=\mathrm{Cov}[f(x_i),f(x_j)] \end{align} $$ (On top of that, since $\kappa(x_i,x_j)$ is a function of $f(x_i),f(x_j),m(x_i)$ and $m(x_j)$, it is entirely determined by $x_i$ and $x_j$, so no need for redundancy in the notations.)

Now, here is the thing : we don't know $f$ in practice (nor $m$), so the expression of $\kappa$ is not tractable. To make up for that lack of information, we have to make some assumptions and choose a covariance function that makes sense for our study case. An example of assumption often made in practice is that the covariance between $f(x_i)$ and $f(x_j)$ only depends on the distance between $x_i$ and $x_j$ : $\kappa(x_i,x_j) = \rho(\|x_i-x_j\|)$ (we say that $\kappa$ is radial, or that the process $f$ is stationary).
In your example with gold prices, if $x_i$ and $x_j$ are points on a map, we could naively suppose that the prices will tend to be close when $x_i$ and $x_j$ are close, since the amount of supply, demand, cost of living etc will also likely be similar (this is just an example, I'm not claiming this is realistic), in which case we would choose a radial kernel.

Regarding the values $y_i$, they are first used for choosing the prior mean $\mu$, as well as the kernel $\kappa$ with its parameters. Afterwards, they are used to compute the values of $f$ at new unobserved points according to the update rule given here.

In short : if the kernel is well chosen, the covariance kernel $\kappa$ will represent the similarity between two points as intended and allow for a good regression model. The choice of the kernel is however a crucial and delicate question, you can see some interesting discussion about it here, here or here.
(About your second point : yes, the indexing $x_i$ is arbitrary, but it will not impact the model).

Best Answer

Related Solutions

Solved – Predicting multiple output variables based on multiple input variables

Gaussian Process Analysis – Why Compute the Covariance Function for Inputs?

Related Question