Solved – Do examples in the training and test sets have to be independent

train

I am working on a machine learning problem where there are several data points collected per user. Some of the points are good and some are bad.

I want to get a good assessment of the machine learning model I build.

Should I choose only 1 point from each user, and either train or test from it? Or can I train/test on points from the same user?

I read this: https://en.wikipedia.org/wiki/Test_set . It says to keep test set instances independent of the train set which makes sense to get a better evaluation of the model.

My main question is whether training on several points from the same user is okay i.e., whether training points need to be independent/

Thanks.

Best Answer

Training on multiple records/observations from the same user/subject is ok, but you want your test data independent of your training data.

For example, you might imagine two approaches to constructing a test set (eg. for cross-validation):

Record wise: select records at random and assign to test set.
Subject wise: select subjects at random, and assign all their records to a test set.

If the records of subjects aren't independent, the two approaches may be extremely different, and one should almost certainly do the latter, selecting subjects at random to place in test set.

What can go wrong with record-wise test set construction?

To take an extreme example, imagine that all the records for each subject were exactly the same and each subject has numerous records. Then with record-wise validation, you'd be training on the test set! If your algorithm overfit the data, you'd get amazing performance on the test set but horrible performance when you actually see new, independent data.

Training and testing on the same set of users can give horribly misleading results that will not predict out of sample performance on new users.

Another example, here's a recent paper that discusses how record-wise cross-validation can go totally wrong in the clinical context: http://biorxiv.org/content/early/2016/06/19/059774.full.pdf+html

Best Answer

What can go wrong with record-wise test set construction?

Related Solutions

Solved – Logistic Regression with empty cells

Solved – How to correctly use validation and test sets for Neural Network training

Machine learning in general

Neural networks

To wrap it up:

Related Question