Machine Learning – Random Split vs Time-Based Split for Train and Test Data

classificationmachine learningneural networkspredictive-modelstrain-test-split

I have been working on binary classification problem using algorithms such as Random Forest, Boosting methods, neural networks and logistic regression. I have data from Jan 2017 to Jan 2022. We wish to train the model based on historical (completed transactions) from Jan 2017 to Jan 2020. And use this model to predict the outcome of active transactions (from Feb 2020 to Jan 2022).

I tried the below two approaches for train test split

a) usual sklearn train_test_split (random)

b) manual train test split (time-based) – all records from 2017 t0 2020 Jan were train and all records from Feb 2020 to Jan 2022 were Test. I use dataframe filter to filter records based on year value.

However, I found out that my performance degraded when I chose time based split (option b above) even for train data.

whereas regular sklearn train_test_split gave better performance.

Can I know why does this happen? Has anyone here encountered this sort of behavior?

How can we avoid this? and what would be the right way to split the data?

Best Answer

Speaking generally, and noting as an aside that data splitting is a bad idea unless you have > 20,000 observations, splitting on time represents a missed opportunity for modeling time trends. To say that a model doesn't validate in a later time period may just mean that there was a time trend that was ignored in model develop. Time can be a very important variable, and one that needs to be modeled as a continuous but nonlinear effect. Rather than splitting on time or place I like to model the effects of time and place.

Related Question