Solved – Test and Training dataset correlation while Splitting the dataset

biostatisticsclassificationdatasettrain

I want to split my main dataset in two part, training dataset and test dataset.

In the past i read somewhere (which unfortunately i could not find exactly where was that), that when splitting my main dataset in train and test i should pay attention that two observation of same origin (correlated data) should not placed in train and test group.

For my question i want to ask, why i should not place 2 observation of same origin in both test and train group (imagine i used k-fold for splitting my main dataset)? what problem it would cause? it cause error in classification or some overfitting/underfitting problem?

Best Answer

Suppose you have a dataset with credit card transactions, with binary labels indicating whether they were fraudulent or genuine. I suppose the user ID or card ID of a transaction could be viewed as its ''origin'' in the context of your question.

Now, suppose you have a single user / card ID which was used in multiple transactions. If it was a fraudster, it's likely they have multiple fraudulent transactions in the dataset (assuming they were not immediately caught and blocked). If you put some of these transactions in the training data, and some other transactions of the same user in the test data, those test cases may be too easy to detect for a trained model.

As an extreme case, suppose we include the card ID itself as a feature when training our model. It can simply memorize that that card ID was associated with fraudulent transactions in the training data, and have an unrealistically easy time detecting and flagging them in the test data. This model would start performing significantly less well in the future, when all those card IDs it has memorized have been blocked.

Of course, this is an extreme example, you probably wouldn't want to include the raw card ID as a feature. The issue can still appear in a less extreme manner in realistic cases though (for example, users may be memorized and recognized based on pattersn in their behaviour which are encoded in features).