Gaussian Process Analysis – Why Compute the Covariance Function for Inputs?

gaussian processnormal distribution

We have the following GP regression model:
\begin{equation}
y_i = f(x_i) + \epsilon_i
\end{equation}

where $y_i\in \mathbb{R}$,$x_i \in \mathbb{R}^d$. Then, a Gaussian process, $f$, is defined by
\begin{equation}
f \sim \mathrm{GP}(m(x_i), \kappa(x_i,x_j)).
\end{equation}

Say we have $n$ datapoints $(y_1,x_1), (y_2,x_2), (y_3,x_3),\ldots, (y_n,x_n)$. The kernel is often said to be thought of as a measure of similarity between points. The kernel is computed using the input values, $x_i$. For many cases, $x$ refers to time or simply an index. Say our dataset is the price of gold given over a 24 hour period, taken every minute. We index the dataset such that $x$ runs from 1 to 1440.

I don't understand why computing, for example, $\kappa(x_1,x_2)$ gives a similarity measure between the price of gold at these two data points – doesn't it ignore the actual data, $y_1$ and $y_2$? Isn't the indexing arbitrary too? We could start it at $x=100$?

Best Answer

Remember that $m(x_i)$ and $\kappa(x_i,x_j)$ are formally defined as : $$m(x_i) = \mathbb E[f(x_i)],\quad \kappa(x_i,x_j)=\mathbb E[(f(x_i)-m(x_i))(f(x_j)-m(x_j))]$$ So, as we can see from the definition, $\kappa(x_i,x_j)$ is equal to $\mathrm{Cov}[f(x_i),f(x_j)]$. It is thus a measure of how much $f(x_i)$ and $f(x_j)$ are correlated with one another, as intended : it represents how much the value at $x=x_i$ will correlate with the value at $x = x_j$.

You might be confused by the fact that $y_i$ and $y_j$ seemingly don't appear in the expression of the covariance function. First notice that $\mathrm{Cov}[f(x_i),f(x_j)] = \mathrm{Cov}[y_i,y_j]$ simply because $$\begin{align}\mathrm{Cov}[y_i,y_j] &= \mathrm{Cov}[f(x_i)+\epsilon_i,f(x_j)+\epsilon_j]\\ &=\mathrm{Cov}[f(x_i),f(x_j)] + \mathrm{Cov}[f(x_i),\epsilon_j]\\ &+\mathrm{Cov}[\epsilon_i, f(x_j)] + \mathrm{Cov}[\epsilon_i, \epsilon_j] \\ &=\mathrm{Cov}[f(x_i),f(x_j)] \end{align} $$ (On top of that, since $\kappa(x_i,x_j)$ is a function of $f(x_i),f(x_j),m(x_i)$ and $m(x_j)$, it is entirely determined by $x_i$ and $x_j$, so no need for redundancy in the notations.)

Now, here is the thing : we don't know $f$ in practice (nor $m$), so the expression of $\kappa$ is not tractable. To make up for that lack of information, we have to make some assumptions and choose a covariance function that makes sense for our study case. An example of assumption often made in practice is that the covariance between $f(x_i)$ and $f(x_j)$ only depends on the distance between $x_i$ and $x_j$ : $\kappa(x_i,x_j) = \rho(\|x_i-x_j\|)$ (we say that $\kappa$ is radial, or that the process $f$ is stationary).
In your example with gold prices, if $x_i$ and $x_j$ are points on a map, we could naively suppose that the prices will tend to be close when $x_i$ and $x_j$ are close, since the amount of supply, demand, cost of living etc will also likely be similar (this is just an example, I'm not claiming this is realistic), in which case we would choose a radial kernel.

Regarding the values $y_i$, they are first used for choosing the prior mean $\mu$, as well as the kernel $\kappa$ with its parameters. Afterwards, they are used to compute the values of $f$ at new unobserved points according to the update rule given here.

In short : if the kernel is well chosen, the covariance kernel $\kappa$ will represent the similarity between two points as intended and allow for a good regression model. The choice of the kernel is however a crucial and delicate question, you can see some interesting discussion about it here, here or here.
(About your second point : yes, the indexing $x_i$ is arbitrary, but it will not impact the model).

Related Question