Solved – Bayesian linear regression (predictive distribution)

bayesiangaussian processregression

I'm studying a book about Gaussian processes in machine learning, and I do not know how to exactly compute the predictive distribution. In this question: Understanding the predictive distribution in gaussian linear regression said it is like $(1)$, but I found in Bishop's Pattern Recognition and ML that this predictive distribution is like $(2)$:
\begin{align}
f_*\mid x_*,\, X,\, y\quad &\to\quad \mathcal{N}(\sigma_n^{-2}x^T_* A^{-1}Xy,\; \hspace{13mm} x_*^T A^{-1}x_*) \tag{1} \\
f_*\mid x_*,\, X,\, y\quad &\to\quad \mathcal{N}(\sigma_n^{-2}x^T_* A^{-1}Xy,\; \sigma_n^2 I + x_*^T A^{-1}x_*) \tag{2}
\end{align}

(That is, with a different variance.) What happened? Why are they not equal?

Best Answer

Short answer

Equations (1) and (2) are different because they give the posterior predictive distribution for different quantities. (1) is for the noiseless linear function output, whereas (2) is for the noisy observed output.

Long answer

Recall that the model is:

$$y_i = w^T x_i + \epsilon_i \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2_n)$$

Each observed output $y_i$ is given by a linear function of the input $x_i$ plus i.i.d. Gaussian noise. The equations listed in the question assume a Gaussian prior on the coefficients $w$, and treat the noise variance $\sigma^2_n$ as a fixed parameter.

Notice that we could also write:

$$y_i = f_i + \epsilon_i$$

where $f_i = w^T x_i$ is the noiseless output of the linear function. This is a latent variable, since it's not directly observed. Rather, we observe the noisy output $y_i$.

Now, suppose we've fit the model to training data $(X,y)$, and want to predict the output for a new set of inputs $x_*$. The posterior of the noiseless function outputs $f_*$ is the Gaussian distribution in equation (1). The derivation is described in chapter 2 of Gaussian processes for machine learning (Rasmusen & Williams 2006), and summarized here.

The posterior of the noisy observed outputs $y_*$ is the Gaussian distribution in equation (2) (but there may be a typo; the variable should be called $y_*$, not $f_*$). Notice that (2) is identical to (1), with the exception that $\sigma^2_n I$ has been added to the covariance matrix. This follows from the fact that the noisy observed outputs are produced by adding independent Gaussian noise to the noiseless function outputs (with mean zero and variance $\sigma^2_n$).

Related Question