Should models built using under-sampled data be evaluated against the population

cross-validationdown-samplemachine learningunder samplingvalidation

  • I have a dataset of 11 mil. rows with a 1:10 ratio between minority and majority classes.

  • To train a model, I have selected all the minority class members and 1/3 of the majority class.

  • The ratio is now 3:10 and the sample data is comprised of 4.33 mil rows

  • I have fit an XGBoost model on this undersampled data with cross validation and 'ok' result for train test and validation sets (all derived from 4.33 mil rows).

My question now is, should I also train/test the model against the full 11 mil rows or can I proceed with the model I have now?

Best Answer

After consultations with some Data Scientists and a bit of googling, it appears that there's no single standard, as commented by @ReneBt.

However, it is recommended that the model be run against the full dataset available with labels and see the performance loss. Such a loss is expected, as under-sampled data has much less information than its super-set.

Now whether that loss is acceptable, is something that depends on a lot of non-technical factors (again, well pointed out by @Rebe)

A good reference: [https://machinelearningmastery.com/train-final-machine-learning-model/] - Which answers common questions about "finalizing" a model.

Related Question