Solved – Trying to understand Gaussian Process

gaussian processmachine learning

I'm reading the GPML book and in Chapter 2 (page 15), it tells how to do regression using Gaussian Process(GP), but I'm having a hard time figuring how it works.

In Bayesian inference for parametric models, we first choose a prior on the model parameters $\theta$, that is $p(\theta)$; second, given the training data $D$, we compute the likelihood $p(D|\theta)$; and finally we have the posterior of $\theta$ as $p(\theta|D)$, which will be used in the predictive distribution $$p(y^*|x^*,D)=\int p(y^*|x^*,\theta)p(\theta|D)d\theta$$, and the above is what we do in Bayesian inference for parametric models, right?

Well, as said in the book, GP is non-parametric, and so far as I understand it, after specifying the mean function $m(x)$ and the covariance function $k(x,x')$, we have a GP over function $f$, $$f \sim GP(m,k)$$, and this is the prior of $f$. Now I have a noise-free training data set $$D=\{(x_1,f_1),…,(x_n,f_n)\}$$, I thought I should compute the likelihood $p(D|f)$ and then the posterior $p(f|D)$, and finally use the posterior to make predictions.

HOWEVER, that's not what the book does! I mean, after specifying the prior $p(f)$, it doesn't compute the likelihood and posterior, but just go straight forward to the predictive prediction.

Question:

1) Why not compute the likelihood and posterior? Just because GP is non-parametric, so we don't do that?

2) As what is done in the book (page 15~16), it derives the predictive distribution via the joint distribution of training data set $\textbf f$ and test data set $\textbf f^*$, which is termed as joint prior. Alright, this confuses me badly, why joint them together?

3) I saw some articles call $f$ the latent variable, why?

Best Answer

and the above is what we do in Bayesian inference for parametric models, right?

The book is using Bayesian model averaging, which is the same for parametric models or any other Bayesian method, given that you have posterior over your parameters.

Now I have a noise-free training data set

It doesn't need to be 'noise-free'. See later pages.

HOWEVER, that's not what the book does! I mean, after specifying the prior p(f), it doesn't compute the likelihood and posterior, but just go straight forward to the predictive prediction.

See this: https://people.cs.umass.edu/~wallach/talks/gp_intro.pdf

I believe, in page 17 we have the prior, and later the likelihood. I believe if you write the derivations, and find the posterior, and then average over the posterior for prediction (like in the weight-space view) it will result in the same equations as in page 19 for mean and covariance.

Related Question