Optimization – Understanding Likelihood vs. Noise Kernel Hyperparameter in GPML Toolbox

gaussian processhyperparameteroptimization

I'm using GPML toolbox by C.E.Rasmussen to solve the basic GP regression problem (presented in the book) with noisy observations. That is to say, estimate the underlying function $f$ of a static noisy mapping

$$y = f(\mathbf{x}) + e, \qquad e \sim \mathcal{N}(0, \sigma^2)$$

from a set of training examples $\{ (\mathbf{x}_i, y_i) \}_{i=1}^{n}$. As far as I understand it, I should respect the noisiness of observations by choosing the kernel as the sum

$$ k(\mathbf{x}_i, \mathbf{x}_j) = k_f(\mathbf{x}_i, \mathbf{x}_j) + \sigma^2_{e}\delta_{ij}$$

where the final term in the sum is the kernel of the white noise (that is, noise of the observations).

When using GPML toolbox, for those who are familiar, you have to specify a likelihood. In my case I chose Gaussian likelihood, which has one hyperparameter – in the code documentation this correponds to formal parameter $s_n$.

So all together, when I perform optimization, I have one hyperparameter for the noise kernel ($\sigma_e$), one for the likelihood ($s_n$) and $d$ (say) hyperparameters for the $k_f$.

I am confused about the meaning of the hyperparameters $\sigma_e$ and $s_n$. Which one of the hyperparameters ($\sigma_e$ or $s_n$) represents the variance of the noise in the observations?

If the Gaussian likelihood is the measurement model, then $s_n$ should be the variance of the observations $y_i$, but then for what reason do we add the noise kernel (with additional hyperparameter ($\sigma_e$), which I think is redundant, at this point since we already have $s_n$ to do the job)? Perhaps they're one and the same and should be tied together during optimization. I'm confused.

GPML code for exact inference:

[n, D] = size(x);
K = feval(cov{:}, hyp.cov, x);  % evaluate covariance matrix
m = feval(mean{:}, hyp.mean, x); % evaluate mean vector
sn2 = exp(2*hyp.lik); % noise variance of likGauss
if sn2<1e-6           % very tiny sn2 can lead to numerical trouble
L = chol(K+sn2*eye(n)); sl =   1; % Cholesky factor of covariance with noise
pL = -solve_chol(L,eye(n));  % L = -inv(K+inv(sW^2))
else
L = chol(K/sn2 + eye(n)); sl = sn2; % Cholesky factor of B
pL = L; % L = chol(eye(n)+sW*sW'.*K)
end
alpha = solve_chol(L,y-m)/sl;

sn2 is the likelihood parameter, hyp.cov contains the kernel hyperparameters (including the noise kernel hyperparameter $\sigma_e$)

Best Answer

So I finally figured out the answer to my problem. The whole crux of my problem was the fundamental misunderstanding of the way one should go about implementing specific regression tasks in GPML toolbox. That is, the correspondence between task formulation and GPML implementation.

Now, to explain this, below is the problem formulation borrowed from the GPML book.

Formulation of GP regression with noisy observations, Rasmussen & Williams

You may be tempted, and rightly so, to go into GPML and implement the covariance function (2.20) like so:

    cov = {@covSum, {@covSEard, @covNoise}};
    lik = @likGauss;
    ... use minimize() and gp() ...

In this case your hyperparameters are:

    hyp.cov = [ell_1, ..., ell_D, sf, sigma_e];
    hyp.lik = [sn];

But in fact, what you should be doing is this:

    cov = @covSEard;
    lik = @likGauss;
    ... use minimize() and gp() ...

In this case your hyperparameters are:

    hyp.cov = [ell_1, ..., ell_D, sf]
    hyp.lik = [sn]    % here sn is identical to the sigma_e in (2.20)

So now you only have the likelihood parameter $s_n (=\sigma_e)$ for the observation noise. And the whole problem of which parameter controls what is gone.

I have to come to realize this, when I inspected the code for the paper "Robust Filtering and Smoothing with Gaussian Processes" available here: http://mloss.org/software/view/396/. In the paper, authors mention the use of the same covariance structure as in (2.20) and yet in the code you can see that:

  1. the number of hyperparameters used is one less, than you would initially expect, (i.e. D+2 instead of D+3),
  2. the hyperparameter $\sigma_e$ was used (in the code for inference) in the place, where the likelihood parameter $s_n$ is used in GPML toolbox.