Explanation of marginal likelihood in Gaussian process

bayesianmachine learningnormal distributionprobability distributionsregression

I am new to GP/non-parametric regression. I am reading Rasmussen's book on Gaussian processes. In Eq. 2.28, 2.29 (Page 19) and in the subsequent passage he writes the marginal likelihood as the integral of likelihood times prior.

$$
p(\textbf{y}|X) = \int p(\textbf{y}|\textbf{f},X)p(\textbf{f}|X)\,d\textbf{f}
$$

I understand the prior is the GP-prior (the model). $p(\textbf{f}|X)$ part is clear to me. This is assumed by us to be a multivariate Gaussian $\mathcal{N}(0,\textbf{K})$.
But the book immediately says after this that the likelihood is a factorized Gaussian of the form $\textbf{y|f} \sim \mathcal{N}(\textbf{f},\sigma_{n}^{2}\textbf{I})$. I don't get how we can make this jump directly. How did this understanding come into play??

I know the regression model looks like $y = f(x) + \epsilon_{n}$. The model for $f(x)$ is our assumption : The same GP-prior as discussed above – $\mathcal{N}(0,\textbf{K})$. The noise are i.i.d Gaussian which is $\mathcal{N}(0,\sigma_{n}^{2})$. So the target distribution $\textbf{y}$ will be $$\mathcal{N}(0,\textbf{K} + \sigma_{n}^{2}\textbf{I})$$. The way I understand this is $y$ is a sum of 2 independent normal distributions. So we can use the property of sum of 2 normals.

But I don't get the "$\textbf{y|f} \sim \mathcal{N}(\textbf{f},\sigma_{n}^{2}\textbf{I})$" part from the discussion. Could anyone please enlighten me how we can directly assume this. I am sorry if this is a silly question, but I am really new to this.

Thanks in advance.

The link to the book: http://www.gaussianprocess.org/gpml/

Best Answer

The key is the assumption of additive independent identically distributed Gaussian noise $\epsilon_n$, i.e. the assumption that observations are given by $\textbf{y} = \textbf{f} + \epsilon_{n}$ where $\epsilon_n \sim \mathcal{N}(0,\sigma_{n}^{2}I)$ independent from $\textbf{f}$. It should be intuitively clear that if you know the noise-free value $\textbf{f}$ then you should expect the observation $\textbf{y}$ to be a Gaussian centered on $\textbf{f}$ and with the covariance matrix of the noise.

We can show this more rigorously by deriving the conditional distribution $\textbf{y|f}$. First, note the distribution of $\textbf{f}$ and of $\epsilon_n$:

$$ \textbf{f} \sim \mathcal{N}(0, K) \\ \epsilon_n \sim \mathcal{N}(0, \sigma_n^2 I). $$

We need to know the joint distribution of $\textbf{f}$ and $\textbf{y}$, so we will calculate the covariance matrices of $\textbf{y}$ and of $\textbf{y}$ with $\textbf{f}$. Since $\textbf{y} = \textbf{f} + \epsilon_n$ we have

$$ \textbf{y} \sim \mathcal{N}(0, K + \sigma_n^2 I) $$

by the properties of idependent Gaussian distributions. Also,

$$ \mathrm{Cov}(\textbf{y}, \textbf{f}) = \mathrm{Cov}(\textbf{f}, \textbf{f}) + \mathrm{Cov}(\epsilon_n, \textbf{f}) = K + 0 = K. $$

Collect the results above into a statement of the joint distribution of $\textbf{y}$ and $\textbf{f}$

$$ \begin{bmatrix} \textbf{y} \\ \textbf{f} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} K + \sigma_n^2 I & K \\ K & K \end{bmatrix} \right) $$

where in the bottom left corner we used the fact that $K = K^T$.

Finally, we find the conditional distribution $\textbf{y|f}$ using identity (A.6) on page 200 in appendix A.2 in Rasmussen (also to be found here)

$$ \textbf{y} | \textbf{f} \sim \mathcal{N}(0 + K K^{-1}(\textbf{f} - 0), K + \sigma_n^2 I - KK^{-1}K^T) \\ \textbf{y} | \textbf{f} \sim \mathcal{N}(\textbf{f}, \sigma_n^2 I) $$

as expected.

Related Question