Rule-Based Label – Random Split vs Time-Based Split

classificationmachine learningneural networkspredictive-modelstrain-test-split

We have a dataset of 977 records (77:23 class ratio) where we try to predict a binary outcome using random forests and neural networks. whether supplier met the target or not. However, we didn't have any labels in the 1st place. Meaning, though some suppliers have met target in the past, our system doesn't capture it. So, we came up with a heuristic to label who met the target and who didn't meet the target. my problem is not a time series problem though but my data has timestamp column

Our data spans from Jan 2017 to Jan 2022.

Approach 1 – Time based split

So, records from Jan 2017 to Dec 2020 are already too late to act upon by the business. So, we plan to consider this as training data and records from Jan 2021 to Jan 2022 will be used as testing data. This is what I call as time-based split

Approach 2 – random train test split

If I don't consider the date/year of data, but just do a random train test split based on sklearn.train_test.split, I call this as random train test split. Assuming, this sort of split can help model learn the time based characteristics as well (because it trains using data points from all years). I verifed the X_train year distribution

Now the problem is,

my approach 1 gives very poor performance (drop in 10 points of f1-score = 45-50 and auc is 76)

my approach 2 gives decent performance (ex: f1=63 for minority class and auc is 85). Lift curves of top 4 deciles are 2 or more times better than random model

Should I do random split or time based split?

If I do random split (approach 2), in test data, the model might be predicting outcomes for records (from 2018, 2019,2020 etc) which are already complete and too late to act upon. But yes, there are some records from 2021 and 2022 as well (for biz to act upon).

If I do time-based split (approach 1), performance of the model is really poor. But, in test data, they have more number of records from 2021 and 2022 to act upon by the business (than approach 2).

Additional question, Does it make sense to let business act upon records (from training data)? because our labels are rule-based (and the event didn't happen in real time). We just came up with a rule to consider it as happened. Now with the model, we have some risk probabilities of outcome happening. So, can business act upon training data records?

Can I seek your expertise here? What would be the right thing to do?

Looks like my data has some time based pattern and that's why the performance degrades if I try to predict for 2021 and 2022 (based on 2017,2018,2019,2020)? So, doing time based split doesn't capture that characteristic (hence results in low performance). Therefore, should I just stick with random train test split?

Best Answer

Looks like my data has some time based pattern ...

This alone itself requires time based split. The train/test split should mimic your future tests, e.g. you'd be predicting people in 2023 by using data in the previous years.

0.76 AUC might be a fair score depending on the problem, especially if you don't have state-of-the art performance results from others in the same dataset. F1-score depends on where you put your thresholds.

What I'd recommend is to evaluate what would be the cost of wrong classifications in terms of business and decide whether to use this model or not, and choose a threshold accordingly.

Related Question