The GP does a good job for your problem's training data. However, it's not so great for the test data. You've probably already ran something like the following yourself:
load('../XYdata_01_01_ab.mat');
for N = 1 : 25
% normalize
m = mean(Y1(N,:));
s = std(Y1(N,:));
Y1(N,:) = 1/s * (Y1(N,:) - m);
Y2(N,:) = 1/s * (Y2(N,:) - m);
covfunc = @covSEiso;
ell = 2;
sf = 1;
hyp.cov = [ log(ell); log(sf)];
likfunc = @likGauss;
sn = 1;
hyp.lik = log(sn);
hyp = minimize(hyp, @gp, -100, @infExact, [], covfunc, likfunc, X1', Y1(N,:)');
[m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X1');
figure;
subplot(2,1,1); hold on;
title(['N = ' num2str(N)]);
f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
x = [1:length(m)];
fill([x'; flipdim(x',1)], f, [7 7 7]/8);
plot(Y1(N,:)', 'b');
plot(m, 'r');
mse_train = mse(Y1(N,:)' - m);
[m s2] = gp(hyp, @infExact, [], covfunc, likfunc, X1', Y1(N,:)', X2');
subplot(2,1,2); hold on;
f = [m+2*sqrt(s2); flipdim(m-2*sqrt(s2),1)];
x = [1:length(m)];
fill([x'; flipdim(x',1)], f, [7 7 7]/8);
plot(Y2(N,:)', 'b');
plot(m, 'r');
mse_test = mse(Y2(N,:)' - m);
disp(sprintf('N = %d -- train = %5.2f test = %5.2f', N, mse_train, mse_test));
end
Tuning the hyperparameters manually and not using the minimize function it is possible to balance the train and test error somewhat, but tuning the method by looking at the test error is not what you're supposed to do. I think what's happening is heavy overfitting to your three subjects that generated the training data. No method will out-of-the-box do a good job here, and how could it? You provide the training data, so the method tries to get as good as possible on the training data without overfitting. And it fact, it doesn't overfit in the classical sense. It doesn't overfit to the data, but it overfits to the three training subjects. E.g., cross-validating with the training set would tell us that there's no overfitting. Still, your test set will be explained poorly.
What you can do is:
Get data from more subjects for training. This way your fourth person will be less likely to look like an "outlier" as it does currently. Also, you have just one sequence of each person, right? Maybe it would help to record the sequence multiple times.
Somehow incorporate prior knowledge about your task that would keep a method from overfitting to specific subjects. In a GP that could be done via the covariance function, but it's probably not that easy to do ...
If I'm not mistaken, the sequences are in fact time-series. Maybe it would make sense to exploit the temporal relations, for instance using recurrent neural networks.
There's most definitely more, but those are the things I can think of right now.
I suspect this is because you are using a Gaussian process with a zero mean function, so that unless the covariance function is non-local, the output will go to zero away from the datapoints. If you are using a local covariance function, such as the squared exponential (RBF), it is a prior over functions that says that the function should be smooth, i.e. the value of the function should be similar to is value in nearby locations. If there are no nearby samples, then th prior says very little about the value of the function as there is no reference. Thus you get a smooth function (you can get no smoother than a straight line) and you just get the mean.
If you want to extrapolate from your model, you need to have a covariance function that tells you what to expect at least over the distance you are extrapolating. A polynomial kernel, or and RBF kernel with a broader length scale may help.
Best Answer
Most research in kernel methods focuses on Mercer kernels, which have two properties: (1) the function is symmetric: $K(x,y)=K(y,x)$ and (2) the function is positive semi-definite (p.s.d.). The Gaussian covariance function is certainly p.s.d., but I can't recall if it is also p.d. -- perhaps you've mistakenly omitted the "semi-" from your mental definition?
Alternatively, this is potentially a numerical issue. I'm not sure how you've concluded that the result is not p.s.d., but, for example, spectral decomposition algorithms will sometimes commit errors due to finite-precision arithmetic, so some of the smallest eigenvalues will be slightly negative (on the order of whatever the error is in your numerical software). These minute errors can be safely ignored in most practical applications. Adding a larger, positive number to the diagonal can suppress this.
Testing if an arbitrary covariance function is a valid kernel relies on Mercer's theorem, which is a bit involved for this forum. I'd recommend referring to his research.