Solved – Estimating the variance of the noise in Gaussian Process prediction

cross-validationgaussian process

I've been trying to use leave-one-out cross-validation to estimate the $\sigma_n$, the variance of the signal noise when doing prediction according to

$E[f_*] = k_*^T(K+\sigma_n^2I)^{-1}y$ (GPML Equation 2.25)

My questions are:

  1. In Chapter 5 of GPML, Rasmussen and Williams suggest using leave one out cross-validation (LOO-CV) to estimate hyperparameters by maximizing the LOO log predictive probability. I think they mean to do this for hyperparameters in the kernel function $k$ — is this appropriate to use for estimating the $\sigma_n$?

  2. If I'm using MSE as my loss function instead of LOO log predictive probability, for some training sets, I get the MSE to be a monotonically increasing function of $\sigma_n$. Does that just mean my training set is just too small? Is there a more appropriate loss function to use?

Best Answer

I presume you mean $\sigma_n$ of the square exponential (now called expodentiated quadratic) kernel:

$k(x,x') = \sigma_n \exp\left(-\frac{(x-x')^2}{2l}\right)$

In which case the LOO negative log marginal likelihood can be used to determine the hyper parameters $\sigma_n$ and $l$. This is done by finding the pair of hyper parameters which maximises the likelihood of the unseen data (the one being left out).

There is a relationship between minimising the MSE and probability of prediction, so I think you should find that you will arrive at the same results either way (although I have not tested this I am making the assumption based on knowledge of fitting linear models and I think the same logic will apply).

What could be happening is that you are being stuck in a local minima. From my experience I have found that there are often three minima in square exponential hyper parameter space. One is when we are overfitting the data, one when we underfit the data and one when it is just right. This relates to balancing the two mane terms of the negative log marginal likelihood:

$\ln P(y|x,θ)=−\frac{1}{2}\ln|K|−\frac{1}{2}y^tK^{−1}y−\frac{N}{2}\ln(2\pi)$

The three components can be seen as balancing the complexity of the GP (to avoid overfit) and the data fit, with a constant on the end. It sounds like your optimisation has found a state where the complexity term is over weighed by the data fit term.

My advice is to perform your optimisation with randomised initialisations of the hyper parameters. Also make sure you don't have any bugs in your code - sometimes they are hard to spot but can cause nightmares (story of my life!)

Related Question