Solved – Which balance strategy to learn from the very imbalanced dataset

I'm using a deep learning approach on a dataset made of ~20 millions of elements, where each element has a TRUE or FALSE label.
This dataset unfortunately is veeery imbalanced: I've 98% of falses and only 2% of trues.

My algorithm uses three subsets: a training set, a validation set and a test set. All these sets are independent.

My algorithm runs this way:

(1) it trains an artificial neural network model on the trainining set

(2) it applies the trained model to the validation set, and computes its ROC AUC

(3) if the AUC at (2) is < 80%, it selects a new model and comes back to (1). Otherwise it goes to (4)

(4) it applies the trained model on the test set

To make my algorithm able to recognize trues and false, I'm using an artificially balanced training dataset made of 50% trues and 50% falses; the same for the validation set (50% trues and 50% falses).
On the contrary, the test set contains 98% falses and 2% trues.

Is this approach correct?

Do you think I should change the artificial balance of the training set? Do you think I should change the artificial balance of the validation set set?

Best Answer

Any method (classification method or choice of accuracy measure) that requires deleting data is deficient. Once you develop a well-calibrated probability estimation model you can use a proper accuracy score to judge it. Also, a semi-proper scoring rule, the $c$-index (concordance probability; AUROC) is completely unaffected by extreme imbalance. So is the coefficient of discrimination described in Section 10.6 of my Course Notes.

Best Answer

Related Solutions

Machine Learning – Should Final Production Model Be Trained on Complete Data or Just Training Set?

Solved – Neural network working well on datasets near the training set, but poorly on farther datasets. Why

Related Question