Solved – Using Gaussian process regression with non Gaussian data

bayesiangaussian processpredictionregression

I have a question about practiacal implementation and interpretation of the Gaussian process regression model given by Rasmussen & Williams (http://www.gaussianprocess.org/gpml).

The regression problem is defined as follows:

Let $x_i∈R^{5}$ be an input vector and $y_i∈R$ be its corresponding target. The set of $M$ inputs are arranged into a matrix $X=[x_1,…,x_M]^⊤$ and their corresponding targets are stored in a matrix $Y=[y_1,…,y_M]^⊤$

I wish to train a GPR model $G={X,Y,θ}$ using the squared exponential function:

$k(x_i,x_j)=α^2exp(\frac{−1}{2β^2(xi−xj)^2})+γ^2δ_{ij}$,

where $δ_{ij}$ equals 1 if $i=j$ and 0 otherwise. The hyperparameters are $θ=(α,β,γ)$ with γγ being the assumed noise level in the training data and $β$ is the length-scale.

To train the model, I need to minimise the negative log marginal likelihood with respect to the hyperparameters:

$−logp(Y∣X,θ)=\frac{1}{2}tr(Y^⊤K^{−1}Y)+\frac{1}{2}log∣K∣+c$,
where c is a constant and the matrix K is a function of the hyperparameters

covfunc = @covSEiso;
likfunc = @likGauss;
sn = 0.1;
hyp.lik = log(sn);
hyp2.cov = [0;0];
hyp2.lik = log(0.1);
hyp2 = minimize(hyp2, @gp, -100, @infExact, [], covfunc, likfunc, X1, Y1(:,n));
exp(hyp2.lik)
nlml2 = gp(hyp2, @infExact, [], covfunc, likfunc, X1, Y1(:, n));
[m s2] = gp(hyp2, @infExact, [], covfunc, likfunc, X1, Y1(:, n), XT);
YT(:, n) = m;

X1,Y1 are the training inputs/targets

XT are the test inputs, YT are the target predictions.

My question is about the use of [m,s2] as predictions for Y (the second to last line of code). If the original sample Y are not normally distributed is the predictive distribution given by the values in [m,s2] stil valid? For example is it valid to say that for a test value $x_{*i}$ the distribution of the prediction for the target $y_{*i}$ is $N(m,s2)$, or am I misinterpreting the implementation of this model?

Best Answer

For a gaussian process, the evaluation of the gaussian processes of an input value $(x+1)$ is the output $y(x+1) $ obtained by

$$\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over y} = {K^y}(\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over x} ,x)*{\left( {{K^Y}} \right)^{ - 1}}*y $$

in your code, this is expressed by $m$ ,the mean value is the value of higher probability. For each value that you want to evaluate, you need to calculate $${K^y}(\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over x} ,x) $$ and then you apply the equation for the mean value to obtain your output value.

You can see a gaussian process as a sum of gaussian distributions over a specific path.

Related Solutions

Solved – How to correctly use the GPML Matlab code for an actual (non-demo) problem

The GP does a good job for your problem's training data. However, it's not so great for the test data. You've probably already ran something like the following yourself:

load('../XYdata_01_01_ab.mat');

for N = 1 : 25
    % normalize
    m = mean(Y1(N,:));
    s = std(Y1(N,:));
    Y1(N,:) = 1/s * (Y1(N,:) - m);
    Y2(N,:) = 1/s * (Y2(N,:) - m);

    covfunc = @covSEiso;
    ell = 2;
    sf = 1;
    hyp.cov = [ log(ell); log(sf)];

    likfunc = @likGauss;
    sn = 1;
    hyp.lik = log(sn);

    hyp = minimize(hyp, @gp, -100, @infExact, [], covfunc, likfunc, X1', Y1(N,:)');
    [m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X1');    
    figure;    
    subplot(2,1,1); hold on;    
    title(['N = ' num2str(N)]);    
    f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
    x = [1:length(m)];
    fill([x'; flipdim(x',1)], f, [7 7 7]/8);
    plot(Y1(N,:)', 'b');
    plot(m, 'r');
    mse_train = mse(Y1(N,:)' - m);

    [m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X2');
    subplot(2,1,2); hold on;
    f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
    x = [1:length(m)];
    fill([x'; flipdim(x',1)], f, [7 7 7]/8);    
    plot(Y2(N,:)', 'b');
    plot(m, 'r');
    mse_test = mse(Y2(N,:)' - m);

    disp(sprintf('N = %d -- train = %5.2f   test = %5.2f', N, mse_train, mse_test));
end

Tuning the hyperparameters manually and not using the minimize function it is possible to balance the train and test error somewhat, but tuning the method by looking at the test error is not what you're supposed to do. I think what's happening is heavy overfitting to your three subjects that generated the training data. No method will out-of-the-box do a good job here, and how could it? You provide the training data, so the method tries to get as good as possible on the training data without overfitting. And it fact, it doesn't overfit in the classical sense. It doesn't overfit to the data, but it overfits to the three training subjects. E.g., cross-validating with the training set would tell us that there's no overfitting. Still, your test set will be explained poorly.

What you can do is:

Get data from more subjects for training. This way your fourth person will be less likely to look like an "outlier" as it does currently. Also, you have just one sequence of each person, right? Maybe it would help to record the sequence multiple times.
Somehow incorporate prior knowledge about your task that would keep a method from overfitting to specific subjects. In a GP that could be done via the covariance function, but it's probably not that easy to do ...
If I'm not mistaken, the sequences are in fact time-series. Maybe it would make sense to exploit the temporal relations, for instance using recurrent neural networks.

There's most definitely more, but those are the things I can think of right now.

Solved – Implementation of Gaussian process

I suspect this is because you are using a Gaussian process with a zero mean function, so that unless the covariance function is non-local, the output will go to zero away from the datapoints. If you are using a local covariance function, such as the squared exponential (RBF), it is a prior over functions that says that the function should be smooth, i.e. the value of the function should be similar to is value in nearby locations. If there are no nearby samples, then th prior says very little about the value of the function as there is no reference. Thus you get a smooth function (you can get no smoother than a straight line) and you just get the mean.

If you want to extrapolate from your model, you need to have a covariance function that tells you what to expect at least over the distance you are extrapolating. A polynomial kernel, or and RBF kernel with a broader length scale may help.

Best Answer

Related Solutions

Solved – How to correctly use the GPML Matlab code for an actual (non-demo) problem

Solved – Implementation of Gaussian process

Related Question