The GP does a good job for your problem's training data. However, it's not so great for the test data. You've probably already ran something like the following yourself:
load('../XYdata_01_01_ab.mat');
for N = 1 : 25
% normalize
m = mean(Y1(N,:));
s = std(Y1(N,:));
Y1(N,:) = 1/s * (Y1(N,:) - m);
Y2(N,:) = 1/s * (Y2(N,:) - m);
covfunc = @covSEiso;
ell = 2;
sf = 1;
hyp.cov = [ log(ell); log(sf)];
likfunc = @likGauss;
sn = 1;
hyp.lik = log(sn);
hyp = minimize(hyp, @gp, -100, @infExact, [], covfunc, likfunc, X1', Y1(N,:)');
[m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X1');
figure;
subplot(2,1,1); hold on;
title(['N = ' num2str(N)]);
f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
x = [1:length(m)];
fill([x'; flipdim(x',1)], f, [7 7 7]/8);
plot(Y1(N,:)', 'b');
plot(m, 'r');
mse_train = mse(Y1(N,:)' - m);
[m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X2');
subplot(2,1,2); hold on;
f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
x = [1:length(m)];
fill([x'; flipdim(x',1)], f, [7 7 7]/8);
plot(Y2(N,:)', 'b');
plot(m, 'r');
mse_test = mse(Y2(N,:)' - m);
disp(sprintf('N = %d -- train = %5.2f test = %5.2f', N, mse_train, mse_test));
end
Tuning the hyperparameters manually and not using the minimize function it is possible to balance the train and test error somewhat, but tuning the method by looking at the test error is not what you're supposed to do. I think what's happening is heavy overfitting to your three subjects that generated the training data. No method will out-of-the-box do a good job here, and how could it? You provide the training data, so the method tries to get as good as possible on the training data without overfitting. And it fact, it doesn't overfit in the classical sense. It doesn't overfit to the data, but it overfits to the three training subjects. E.g., cross-validating with the training set would tell us that there's no overfitting. Still, your test set will be explained poorly.
What you can do is:
Get data from more subjects for training. This way your fourth person will be less likely to look like an "outlier" as it does currently. Also, you have just one sequence of each person, right? Maybe it would help to record the sequence multiple times.
Somehow incorporate prior knowledge about your task that would keep a method from overfitting to specific subjects. In a GP that could be done via the covariance function, but it's probably not that easy to do ...
If I'm not mistaken, the sequences are in fact time-series. Maybe it would make sense to exploit the temporal relations, for instance using recurrent neural networks.
There's most definitely more, but those are the things I can think of right now.
Best Answer
This seems to be a modelling problem rather than an implementation problem. Let us plot the predictions (ymu) from your model and training data with more points:
The axes of the image represent the inputs Xtest1,Xtest2, and the color is the predicted mean. We see that the model output deviates from 0 only near the training points. This is due to the short length scale hyperparameter (ell) that causes the model to learn only short range phenomena. Let us try modifying that (all other code kept unchanged)
Now the predicted output is much closer to your ground truth. Indeed, the prediction at $(1,2)$ is now $2.9$. It seems that (with the training data kept fixed), the longer the length-scale, the closer the prediction at your testpoint gets to 3.
Another observation is that all your training data are located at the $x_1=x_2$ line, so it is rather optimistic to hope that any model would learn what happens outside that line without making strong assumptions. Besides changing the hyperparameters, you could also play with adding training data at various other points, e.g., in a 2-dimensional grid.