Solved – the advantage of shuffling data in train-test split

assumptionsmachine learningoverfitting

I have been taking an online course on data science, and was recently introduced the ideas of overfitting and underfitting, that is splitting the dataset we have into two parts into training(80%-90%) and testing(10%-20%) dataset. The instructor says we prefer to shuffle our data before the split, what is the reason for this? Isnt the whole idea of the approach of Splitting into Testing/Training Set is based on the assumption, that the observations are Independent and identically distributed random variables. Then there is no meaning to shuffling our data. Did i get it correct?

Best Answer

It may depend on where the data came from and how it was exported. It's not uncommon that real world data is sorted in some manner. For example it could be sorted by:

  • user id
  • timestamp of the observation
  • outcome of interest

In each of these cases if you do a test/train split on the data without shuffling then you may have a different data distribution in your splits! For example if the data is sorted by user_id, then most users will appear in either test or training set, but not both.

How much this matters probably depends on how you intend to use your model. For example in lots of real world ML applications you train on historical data, and make predictions on future unseen data. In that case having the data sorted by timestamp before creating your splits might actually be desirable, since it matches the way you'll apply the model in the real world.

Related Question