Solved – How to correctly use the GPML Matlab code for an actual (non-demo) problem

gaussian processmachine learningMATLABregression

I have downloaded the most recent GPML Matlab code GPML Matlab code and I have read the documentation and ran the regression demo without any problems. However, I am having difficulty understanding how to apply it to a regression problem that I am faced with.

The regression problem is defined as follows:


Let $\mathbf{x}_i \in \mathbb{R}^{20}$ be an input vector and $\mathbf{y}_i \in \mathbb{R}^{25}$ be its corresponding target. The set of $M$ inputs are arranged into a matrix $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_M]^\top$ and their corresponding targets are stored in a matrix $\mathbf{Y} = [\mathbf{y}_1 – \mathbf{\bar{y}}, \dots, \mathbf{y}_M-\mathbf{\bar{y}}]^\top$, with $\mathbf{\bar{y}}$ being the mean target value in $\mathbf{Y}$.

I wish to train a GPR model $\mathcal{G} = \lbrace \mathbf{X}, \mathbf{Y}, \theta \rbrace$ using the squared exponential function:

$k(\mathbf{x}_i, \mathbf{x}_j) = \alpha^2 \text{exp} \left( – \frac{1}{2\beta^2}(\mathbf{x}_i – \mathbf{x}_j)^2\right) + \gamma^2\delta_{ij}$,

where $\delta_{ij}$ equals $1$ if $i = j$ and $0$ otherwise. The hyperparameters are $\theta = (\alpha, \beta, \gamma)$ with $\gamma$ being the assumed noise level in the training data and $\beta$ is the length-scale.

To train the model, I need to minimise the negative log marginal likelihood with respect to the hyperparameters:

$-\text{log}\, p(\mathbf{Y} \mid \mathbf{X}, \theta) = \frac{1}{2} \text{tr}(\mathbf{Y}^\top\mathbf{K}^{-1}\mathbf{Y}) + \frac{1}{2}\text{log}\mid\mathbf{K}\mid + \,c,$

where c is a constant and the matrix $\mathbf{K}$ is a function of the hyperparameters (see equation k(xi,xj) = …).


Based on the demo given on the GPML website, my attempt at implementing this using the GPML Matlab code is below.

covfunc = @covSEiso;
likfunc = @likGauss;
sn = 0.1;
hyp.lik = log(sn);
hyp2.cov = [0;0];
hyp2.lik = log(0.1);
hyp2 = minimize(hyp2, @gp, -100, @infExact, [], covfunc, likfunc, X1, Y1(:, n));
exp(hyp2.lik)
nlml2 = gp(hyp2, @infExact, [], covfunc, likfunc, X1, Y1(:, n));
[m s2] = gp(hyp2, @infExact, [], covfunc, likfunc, X1, Y1(:, n), X2);
Y2r(:, n) = m;

X1 contains the training inputs

X2 contains the test inputs

Y1 contains the training targets

Y2r are the estimates from applying the model

n is the index used to regress each element in the output vector

Given the problem, is this the correct way to train and apply the GPR model? If not, what do I need to change?

Best Answer

The GP does a good job for your problem's training data. However, it's not so great for the test data. You've probably already ran something like the following yourself:

load('../XYdata_01_01_ab.mat');

for N = 1 : 25
    % normalize
    m = mean(Y1(N,:));
    s = std(Y1(N,:));
    Y1(N,:) = 1/s * (Y1(N,:) - m);
    Y2(N,:) = 1/s * (Y2(N,:) - m);

    covfunc = @covSEiso;
    ell = 2;
    sf = 1;
    hyp.cov = [ log(ell); log(sf)];

    likfunc = @likGauss;
    sn = 1;
    hyp.lik = log(sn);

    hyp = minimize(hyp, @gp, -100, @infExact, [], covfunc, likfunc, X1', Y1(N,:)');
    [m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X1');    
    figure;    
    subplot(2,1,1); hold on;    
    title(['N = ' num2str(N)]);    
    f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
    x = [1:length(m)];
    fill([x'; flipdim(x',1)], f, [7 7 7]/8);
    plot(Y1(N,:)', 'b');
    plot(m, 'r');
    mse_train = mse(Y1(N,:)' - m);

    [m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X2');
    subplot(2,1,2); hold on;
    f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
    x = [1:length(m)];
    fill([x'; flipdim(x',1)], f, [7 7 7]/8);    
    plot(Y2(N,:)', 'b');
    plot(m, 'r');
    mse_test = mse(Y2(N,:)' - m);

    disp(sprintf('N = %d -- train = %5.2f   test = %5.2f', N, mse_train, mse_test));
end

Tuning the hyperparameters manually and not using the minimize function it is possible to balance the train and test error somewhat, but tuning the method by looking at the test error is not what you're supposed to do. I think what's happening is heavy overfitting to your three subjects that generated the training data. No method will out-of-the-box do a good job here, and how could it? You provide the training data, so the method tries to get as good as possible on the training data without overfitting. And it fact, it doesn't overfit in the classical sense. It doesn't overfit to the data, but it overfits to the three training subjects. E.g., cross-validating with the training set would tell us that there's no overfitting. Still, your test set will be explained poorly.

What you can do is:

  1. Get data from more subjects for training. This way your fourth person will be less likely to look like an "outlier" as it does currently. Also, you have just one sequence of each person, right? Maybe it would help to record the sequence multiple times.

  2. Somehow incorporate prior knowledge about your task that would keep a method from overfitting to specific subjects. In a GP that could be done via the covariance function, but it's probably not that easy to do ...

  3. If I'm not mistaken, the sequences are in fact time-series. Maybe it would make sense to exploit the temporal relations, for instance using recurrent neural networks.

There's most definitely more, but those are the things I can think of right now.

Related Question