Log marginal likelihood for Gaussian Process as per Rasmussen's Gaussian Processes for Machine Learning equation 2.30 is:
$$\log p(y|X) = -\frac{1}{2}y^T(K+\sigma^2_n I)^{-1}y – \frac{1}{2}\log|K+\sigma^2_n I|-\frac{n}{2}\log2\pi$$
Where as Matlab's documentation on Gaussian Process formulates the relation as
$$\log p(y|X, \beta, \theta, \sigma^2) = -\frac{1}{2}\left(y-H\beta\right)^T(K+\sigma^2_n I)^{-1}\left(y-H\beta\right) – \frac{1}{2}\log|K+\sigma^2_n I|-\frac{n}{2}\log2\pi$$
where $H$ is the vector of basis functions and $\beta$ is coefficient vector.
My doubts:
- Why there is difference in the two relations?
- From my understanding, $H\beta$ is prediction from Gaussian Process; am I right?
Thanks
Best Answer
The more general formulation for the log marginal likelihood (not marginal log likelihood, as you originally wrote - I edited it in your post) of a GP is
$$\log p(y|X) = -\frac{1}{2}(y - m(X))^T(K+\sigma^2_n I)^{-1}(y - m(X)) - \frac{1}{2}\log|K+\sigma^2_n I|-\frac{n}{2}\log2\pi$$
where $m(x): \mathbb{R}^d \rightarrow \mathbb{R}$ for a given point $x$ is a mean function of a GP; and the notation $m(X)$ represents a vector function obtained by applying the mean function to every point in $X$. The GP in GPML (Eq. 2.30) is a zero-mean GP.
In the Matlab version, $H \beta$ stands for a mean function expressed as a linear combination of basis functions $H = H(x)$, it is not the prediction of the GP.
The GP mean prediction will revert to the mean function very far away from points in the training set $X$ (very far in terms of length scale of the kernel), but it is going to be generally different otherwise.