Solved – Covariance in Gaussian Process

covariancecovariance-matrixgaussian processkernel trickmachine learning

I am a little confused over formula for calculating covariance in Gaussian process (addition of variance always confuses me as it is not always explicitly denoted). The origin of confusion is that the formulas are given in Pattern Recognition and Machine Learning by Bishop and Gaussian process for Machine Learning by Rasmussen are different.

Mean of GP is given by relation:
$$\mu = K(X_*, X)[K(X,X)+\sigma^2\mathrm{I}]^{-1}y$$

Variance according to Bishop (page no: 308) is:
$$\Sigma = [K(X_*, X_*)+\sigma^2] – K(X_*, X)[K(X,X)+\sigma^2\mathrm{I}]^{-1}K(X, X_*)$$

Variance according to Rasmussen (page no: 16) is:
$$\Sigma = K(X_*, X_*) – K(X_*, X)[K(X,X)+\sigma^2\mathrm{I}]^{-1}K(X, X_*)$$

My doubt is whether the variance is there or not in first term in RHS for covariance matrix $\Sigma$. Or have I messed up things?

Let me know if I need to provide more information.

Best Answer

Noise parameter, $\sigma^2$, is the parameter of the likelihood function a.k.a noise function.

The one with $+\sigma^2$ is the variance of $y$ (observation). The one without is the variance of $f$ (latent variable = observation - noise). So they are off from each other by $\sigma^2$ which is the same for all values of input variable $x$.

The formulas look right to me. As you see the variance of $y$ (the noiseless observation) is also dependent on the noise parameter. It make sense too. Your estimation of noise would affect the uncertainty estimate (i.e. variance) of the (noiseless) latent variable.

To avoid confusion I would refer to them by $\mathrm{var}(y)$ and $\mathrm{var}(f)$.

One more thing: the two expressions you denoted by $\Sigma$ are scalars, not matrices. The covariance matrix is $K$ not $\Sigma$. $\Sigma$ is variance not covariance since it is about a single $1$-D variable (either $y$ or $f$).

Best Answer

Related Solutions

Solved – Hyperparameter estimation in Gaussian process

Solved – Gaussian process – Why adding data points cannot increase the predictive bias

Related Question