Solved – In Gaussian processes, why does the conditional Gaussian “agree” with data

I'm learning about GPs, and one thing I don't quite understand is how the posterior works. Consider this figure:

Rasmussen and Williams say:

Graphically in Figure 2.2 you may think of generating functions from the prior, and rejecting the ones that disagree with the observations… Fortunately, in probabilistic terms this operation is extremely simple, corresponding to conditioning the joint Gaussian prior distribution on the observations.

To formalize a bit, given this joint distribution,

$$
\begin{bmatrix}
\mathbf{f}_* \\ \mathbf{f}
\end{bmatrix}
\sim
\mathcal{N} \Bigg(
\begin{bmatrix}
\mathbf{0} \\ \mathbf{0}
\end{bmatrix},
\begin{bmatrix}
K(X_*, X_*) & K(X_*, X)
\\
K(X, X_*) & K(X, X)
\end{bmatrix}
\Bigg)
$$

the conditional distribution is

$$
\begin{align}
\mathbf{f}_{*} \mid \mathbf{f}
\sim
\mathcal{N}(&K(X_*, X) K(X, X)^{-1} \mathbf{f},\\
&K(X_*, X_*) – K(X_*, X) K(X, X)^{-1} K(X, X_*))
\end{align}
$$

What I don't understand is how samples from this conditional distribution always "agree" with the observations? Aren't the samples $\mathbf{f}_*$ still instances of Gaussian random variables?

Best Answer

To be slightly more explicit about what I think your question is:

Yes, samples from the posterior are Gaussian everywhere, including exactly at the previously-observed points. But, in this "noise-free" setting, the variance at those points is 0 – so a Gaussian with variance 0 is always going to be exactly its mean.

It's easiest to see this in the case where we condition on only point, $X = X_*$, in which case the conditional variance becomes $$K(X, X) - K(X, X) K(X, X)^{-1} K(X, X) = 0,$$ and the conditional mean is $$K(X, X) K(X, X)^{-1} \mathbf{f} = \mathbf{f}.$$

Best Answer

Related Solutions

Related Question