Gaussian Processes – Comparing Negative Log Marginal Likelihood in Gaussian Processes

gaussian processmachine learning

I have a given data set $\mathcal{X}$ with features $X$ and targets $Y$ that I learn with a Gaussian process regression, kernelized by $k$. Now I produce a new set of features $X^{\prime}$ (which might, for example, be a subset of $X$) that describes $\mathcal{X}$ and learn it with $k$. In a third case, I might change $k$ to $k^{\prime}$ and learn on the original features $X$. To compare the results, I can look at the loss function and compare them. I take the kernel and featurization that yields the lowest loss. But how does this relate to the negative log marginal likelihood (NLML) of the Gaussian process? I imagine the result with the lowest loss needn't be the one with the lowest NLML (and vice versa), right?

Is it even possible to compare NLML of different models with each other?

(e.g. here they obtain different LML, but in what way are they comparable?)

Best Answer

The log marginal likelihood has three terms:

$log\ p(y|X,\theta) = -\frac{1}{2}y^TK_y^{-1}y-\frac{1}{2}log|K_y|-\frac{n}{2}log2\pi$

The first one penalizes wrong predictions, the second one penalizes model complexity, and the third is a normalization term. How it compares to your loss function depends on what your loss function is. For example, squared error loss only penalizes wrong predictions without accounting for model complexity.

The marginal likelihood is the probability of getting your observations from the functions in your GP prior (which is defined by the kernel). When you minimize the negative log marginal likelihood over $\theta$ for a given family of kernels (for example, RBF, Matern, or cubic), you're comparing all the kernels of that family (as defined by their hyperparameters $\theta$) and choosing the most likely kernel. Importantly, changing $\theta$ means you now have a different kernel and therefore a different model.

By extension then, you can also compare the NLML for instances (defined by their $\theta$) of different families of kernels. For example, you do the minimization over RBF kernels and find that the NLML is minimized at hyperparameters $\hat{\theta}_1$ with a value of $a$, and then minimize over cubic kernels and find that the NLML is minimized at hyperparameters $\hat{\theta}_2$ with a value of $b$. In this case, comparing $a$ and $b$ will tell you whether a GP with an RBF kernel with hyperparameters $\hat{\theta}_1$ or a GP with a cubic kernel with hyperparameters $\hat{\theta}_2$ is more likely.

What this won't tell you is whether a general RBF or cubic kernel is more likely, only which of the two specific kernels you found is more likely. If you want to know whether a general RBF kernel or a general cubic kernel is more likely, then you need to move up another level and marginalize over all the hyperparameters for each family of kernels, and then compare those probabilities. I'm not sure there's a closed-form equation for that, but you could do it using Markov-chain Monte Carlo.

Related Question