Solved – How to correctly use the GPML Matlab code for an actual (non-demo) problem

gaussian processmachine learningMATLABregression

I have downloaded the most recent GPML Matlab code GPML Matlab code and I have read the documentation and ran the regression demo without any problems. However, I am having difficulty understanding how to apply it to a regression problem that I am faced with.

The regression problem is defined as follows:

Let $\mathbf{x}_i \in \mathbb{R}^{20}$ be an input vector and $\mathbf{y}_i \in \mathbb{R}^{25}$ be its corresponding target. The set of $M$ inputs are arranged into a matrix $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_M]^\top$ and their corresponding targets are stored in a matrix $\mathbf{Y} = [\mathbf{y}_1 – \mathbf{\bar{y}}, \dots, \mathbf{y}_M-\mathbf{\bar{y}}]^\top$, with $\mathbf{\bar{y}}$ being the mean target value in $\mathbf{Y}$.

I wish to train a GPR model $\mathcal{G} = \lbrace \mathbf{X}, \mathbf{Y}, \theta \rbrace$ using the squared exponential function:

$k(\mathbf{x}_i, \mathbf{x}_j) = \alpha^2 \text{exp} \left( – \frac{1}{2\beta^2}(\mathbf{x}_i – \mathbf{x}_j)^2\right) + \gamma^2\delta_{ij}$,

where $\delta_{ij}$ equals $1$ if $i = j$ and $0$ otherwise. The hyperparameters are $\theta = (\alpha, \beta, \gamma)$ with $\gamma$ being the assumed noise level in the training data and $\beta$ is the length-scale.

To train the model, I need to minimise the negative log marginal likelihood with respect to the hyperparameters:

$-\text{log}\, p(\mathbf{Y} \mid \mathbf{X}, \theta) = \frac{1}{2} \text{tr}(\mathbf{Y}^\top\mathbf{K}^{-1}\mathbf{Y}) + \frac{1}{2}\text{log}\mid\mathbf{K}\mid + \,c,$

where c is a constant and the matrix $\mathbf{K}$ is a function of the hyperparameters (see equation k(xi,xj) = …).

Based on the demo given on the GPML website, my attempt at implementing this using the GPML Matlab code is below.

covfunc = @covSEiso;
likfunc = @likGauss;
sn = 0.1;
hyp.lik = log(sn);
hyp2.cov = [0;0];
hyp2.lik = log(0.1);
hyp2 = minimize(hyp2, @gp, -100, @infExact, [], covfunc, likfunc, X1, Y1(:, n));
exp(hyp2.lik)
nlml2 = gp(hyp2, @infExact, [], covfunc, likfunc, X1, Y1(:, n));
[m s2] = gp(hyp2, @infExact, [], covfunc, likfunc, X1, Y1(:, n), X2);
Y2r(:, n) = m;

X1 contains the training inputs

X2 contains the test inputs

Y1 contains the training targets

Y2r are the estimates from applying the model

n is the index used to regress each element in the output vector

Given the problem, is this the correct way to train and apply the GPR model? If not, what do I need to change?

Best Answer

The GP does a good job for your problem's training data. However, it's not so great for the test data. You've probably already ran something like the following yourself:

load('../XYdata_01_01_ab.mat');

for N = 1 : 25
    % normalize
    m = mean(Y1(N,:));
    s = std(Y1(N,:));
    Y1(N,:) = 1/s * (Y1(N,:) - m);
    Y2(N,:) = 1/s * (Y2(N,:) - m);

    covfunc = @covSEiso;
    ell = 2;
    sf = 1;
    hyp.cov = [ log(ell); log(sf)];

    likfunc = @likGauss;
    sn = 1;
    hyp.lik = log(sn);

    hyp = minimize(hyp, @gp, -100, @infExact, [], covfunc, likfunc, X1', Y1(N,:)');
    [m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X1');    
    figure;    
    subplot(2,1,1); hold on;    
    title(['N = ' num2str(N)]);    
    f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
    x = [1:length(m)];
    fill([x'; flipdim(x',1)], f, [7 7 7]/8);
    plot(Y1(N,:)', 'b');
    plot(m, 'r');
    mse_train = mse(Y1(N,:)' - m);

    [m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X2');
    subplot(2,1,2); hold on;
    f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
    x = [1:length(m)];
    fill([x'; flipdim(x',1)], f, [7 7 7]/8);    
    plot(Y2(N,:)', 'b');
    plot(m, 'r');
    mse_test = mse(Y2(N,:)' - m);

    disp(sprintf('N = %d -- train = %5.2f   test = %5.2f', N, mse_train, mse_test));
end

Tuning the hyperparameters manually and not using the minimize function it is possible to balance the train and test error somewhat, but tuning the method by looking at the test error is not what you're supposed to do. I think what's happening is heavy overfitting to your three subjects that generated the training data. No method will out-of-the-box do a good job here, and how could it? You provide the training data, so the method tries to get as good as possible on the training data without overfitting. And it fact, it doesn't overfit in the classical sense. It doesn't overfit to the data, but it overfits to the three training subjects. E.g., cross-validating with the training set would tell us that there's no overfitting. Still, your test set will be explained poorly.

What you can do is:

Get data from more subjects for training. This way your fourth person will be less likely to look like an "outlier" as it does currently. Also, you have just one sequence of each person, right? Maybe it would help to record the sequence multiple times.
Somehow incorporate prior knowledge about your task that would keep a method from overfitting to specific subjects. In a GP that could be done via the covariance function, but it's probably not that easy to do ...
If I'm not mistaken, the sequences are in fact time-series. Maybe it would make sense to exploit the temporal relations, for instance using recurrent neural networks.

There's most definitely more, but those are the things I can think of right now.

Related Solutions

Solved – Using GPML in Matlab for MultiClass Classification

The problem is that you are not supplying the right number of initial hyper-parameter values for the covariance function you are using. In the example the data set has two attributes, so in that case the ARD covariance function needs three hyper-parameters (two scale parameters, one for each attribute and an overall scale factor for the covariance). For this covariance function (covSEard), you need one more hyper-parameter than the number of attributes.

I would suggest changing covfunc to @covSEiso and hyp.cov to log([ell sf])

For a dataset with as many attributes as yours I would forget trying to use an ARD covariance, there are just so many hyper-parameters that you will just end up over-fitting the marginal likelihood in model selection and end up with a very poor model.

I just wanted to add, GPML is a great piece of kit, I'd strongly recommend it to anyone interested in non-linear regression or machine learning approaches to pattern recognition. Rasmussen and Williams book is similarly excellent.

Solved – Likelihood vs. noise kernel hyperparameter in GPML Toolbox

So I finally figured out the answer to my problem. The whole crux of my problem was the fundamental misunderstanding of the way one should go about implementing specific regression tasks in GPML toolbox. That is, the correspondence between task formulation and GPML implementation.

Now, to explain this, below is the problem formulation borrowed from the GPML book.

Formulation of GP regression with noisy observations, Rasmussen & Williams

You may be tempted, and rightly so, to go into GPML and implement the covariance function (2.20) like so:

    cov = {@covSum, {@covSEard, @covNoise}};
    lik = @likGauss;
    ... use minimize() and gp() ...

In this case your hyperparameters are:

    hyp.cov = [ell_1, ..., ell_D, sf, sigma_e];
    hyp.lik = [sn];

But in fact, what you should be doing is this:

    cov = @covSEard;
    lik = @likGauss;
    ... use minimize() and gp() ...

In this case your hyperparameters are:

    hyp.cov = [ell_1, ..., ell_D, sf]
    hyp.lik = [sn]    % here sn is identical to the sigma_e in (2.20)

So now you only have the likelihood parameter $s_n (=\sigma_e)$ for the observation noise. And the whole problem of which parameter controls what is gone.

I have to come to realize this, when I inspected the code for the paper "Robust Filtering and Smoothing with Gaussian Processes" available here: http://mloss.org/software/view/396/. In the paper, authors mention the use of the same covariance structure as in (2.20) and yet in the code you can see that:

the number of hyperparameters used is one less, than you would initially expect, (i.e. D+2 instead of D+3),
the hyperparameter $\sigma_e$ was used (in the code for inference) in the place, where the likelihood parameter $s_n$ is used in GPML toolbox.

Best Answer

Related Solutions

Solved – Using GPML in Matlab for MultiClass Classification

Solved – Likelihood vs. noise kernel hyperparameter in GPML Toolbox

Related Question