Solved – Cross-validation of multiple subjects with multiple instances

classificationcross-validationmachine learningsupervised learningtrain

I have a training set of 50 subjects with about 550-600 measurements each. One measurement consists of 24 features and one class label (1 or 0). So my data looks like this (simplified):

Subject  F1  F2  F3  ...  Class
1        1   3   2   ...  1
1        1   4   7   ...  0
...
2        2   3   2   ...  1
2        1   1   1   ...  1
...

I want to train my classification model (artificial neural network) with all measurements of all 50 subjects. Now I want to know if I can just use leave-one-out-cross-validation on the subject level (training with all measurements of all but one subject and then testing on the measurements of the remaining subject) or is validation on measurement level also needed (k-fold cross-validation within the measurements of each single subject)?

Best Answer

To answer this question, you should ask what you want your classifier to be able to do.

If I understand correctly, when you train your classifier on the 'measurement level' you would 'teach' the classifier to distinguish (classify) a set of features of a single subject. This is different from training it to classify any set of features independent of what subject it came from.

Assuming you want your classifier to be able to classify any set of features, independent of what subject those features came from, you should not do any cross-validation on the 'measurement level'.

In this same setting. I do not exactly understand why you would consider all the measurements of a single subject as 'one' (In the context of the leave-one-out cross-validation). Did you consider a single set of features (a single measurement), independent of what subject it came from, as being 'one' thing?

[EDIT]: I just discussed this problem with a colleague. Doing cross-validation on the entire dataset (independent of subject) will leak information.

If part of the measurements of a subject are included in the training set and if the other measurements of that same subject are in the testing set, this will overestimate the performance! Although, if you do not get any measurements from new subjects, this is not a problem.

In the case you do get measurements from new subjects (more likely, I think), then it is a good idea to include all the measurements of a subject in the testing set, if that subject was selected to be in that set.

Related Question