Solved – the role of ‘shuffle’ in train_test_split()

machine learningpythonscikit learntrain-test-split

Wondering what shuffle does if I set it to True when splitting a dataset into train and test splits. Can I use shuffle on a dataset which is ordered by dates?

train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)

Example dataframe:
enter image description here

Best Answer

With time-series data, where you can expect auto-correlation in the data you should not split the data randomly to train and test set, but you should rather split it on time so you train on past values to predict future. Scikit-learn has the TimeSeriesSplit functionality for this.

The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly. For example, say that you have balanced binary classification data and it is ordered by labels. If you split it in 80:20 proportions to train and test, your test data would contain only the labels from one class. Random shuffling prevents this.

If random shuffling would break your data, this is a good argument for not splitting randomly to train and test. In such cases, you would use splits on time, or clustered splits (say you have data on education, so you sample whole schools to train and test, rather than individual students).

When should you use shuffle=False? TL;DR never.

  • Your data was randomly sampled or was already shuffled. But shuffling one more time wouldn't hurt you. I remember seeing multiple datasets that were supposed to be randomly shuffled but weren't.
  • Your dataset is huge, so shuffling makes the whole pipeline a little bit slower. If that is the case, you probably don't want to use scikit-learn pipelines for preprocessing as well. If you use instead something else that scales better, still you need to make sure that it shuffles the data.
  • You don't want to split randomly and your data is already arranged in the way how you want to split it, for example, you have data collected during the 2010-2020 period and you want to split in 80:20 proportions with years 2010-2018 in train set and 2019-2020 in test set. Here it makes sense, but you would probably would like to use the TimeSeriesSplit functionality instead or write the code by hand to have greater control on what you are doing. For example, if you want to split by years, you probably don't want by accident few days of one year to land in other set than the rest of the year--so you would rather do the split manually.
Related Question