Solved – Neural network working well on datasets near the training set, but poorly on farther datasets. Why

cross-validationdeep learningneural networks

I've been using a siamese neural network for the binary classification of biological data.
Each entry of the datasets I'm using has a position coordinate.

My problem is that, even if my neural network is able to do excellent predictions on the datasets that are spatially near to the training set, it is not able to do the same on farther datasets.

I'm using an held-out (no k-fold cross validation) optimization approach: the algorithm reads an input dataset, and splits it into a training set containing the 80% of the input elements, and a validation set containing the remaining 20% of the input elements.

The algorithm trains the neural network by using the training set, and then applies the trained model on the held-out validation set. By doing this, the algorithm is able to get excellent prediction scores on the validation set (e.g. Matthews correlation coefficient >= 0.9).

On the contrary, the problem come up when I try to apply my trained siamese neural network to test sets that are NOT adjacent to the training set. In these cases, my prediction scores go very bad (MCC ~= +0.1).

I also attach this simple image to better explain my problem:
Difference between validation set test and test set test

Can someone help me with this?
What should I do to solve this problem?
Thanks

Best Answer

You must have some autocorrelation in your data. In most cases, if one ignores correlation structure in the data (pseudolikelihood), the effect is that the estimated error in the data is too small. Suppose you considered the weather on two consecutive days, they are far more likely to be similar than the weather on two randomly selected days in the year.

Basically, you have done the test/training selection incorrectly. You must select at random from an entire sample and not contiguous rows. This is why simple random sampling is unbiased but convenience sampling is. Sampling contiguous rows of data, which are ordered in some sense, is effectively convenience sampling.

The graphic that you have used should be a scrambling of different colors for each of the training/test/validation sets.

Related Question