Solved – How to train classifier for unbalanced class distributions

distributionsneural networksunbalanced-classes

I attempted a ReLU neural network to classify data sets of 3 classes that are not balanced (in both training and test sets), i.e. 30% of samples are in class A, 10% in class B and 60% in class C. And in particular for this problem, I'm mostly interested in the precision of class C (with reasonable recalls) since that's the only class I can take use of. Currently I artificially clone and add random +/-5% adjustments to each class A and B samples so that each class has roughly 1/3 samples in the training set. And then I choose the winning epoch based on F1 score for class C.

NEW BEST: epoch 1, score: 0.572852844535, F1: 0.58989
5, precision 0.516919, recall 0.686862, accuracy 0.643098 (0.572852844535), learning_rate=1.0 (patience: 320000 / 1599)
F1: 0.589895, precision 0.516919, recall 0.686862, accuracy 0.643098
precisions: [ 0.19046712  0.48642075  0.61648193]
recalls: [ 0.17856346  0.10650572  0.82099259]
class[0] is predicted as class[0]: 40
class[0] is predicted as class[1]: 4
class[0] is predicted as class[2]: 180
class[1] is predicted as class[0]: 54
class[1] is predicted as class[1]: 36
class[1] is predicted as class[2]: 248
class[2] is predicted as class[0]: 116
class[2] is predicted as class[1]: 34
class[2] is predicted as class[2]: 688

NEW BEST epoch 14, score: 0.708267443522, F1: 0.5302
56, precision 0.612621, recall 0.467413, accuracy 0.556719 (0.708267443522), learning_rate=0.974310040474 (patience: 343195 / 22399)
F1: 0.530256, precision 0.612621, recall 0.467413, accuracy 0.556719
precisions: [ 0.22606464  0.33912306  0.82626222]
recalls: [ 0.49551359  0.46152481  0.44271548]
class[0] is predicted as class[0]: 111
class[0] is predicted as class[1]: 89
class[0] is predicted as class[2]: 24
class[1] is predicted as class[0]: 128
class[1] is predicted as class[1]: 156
class[1] is predicted as class[2]: 54
class[2] is predicted as class[0]: 252
class[2] is predicted as class[1]: 215
class[2] is predicted as class[2]: 371

As seen above, at epoch 1, the accuracy looks much better because the network just classified all test samples to class C; in epoch 14, the accuracy looks worse but is in fact better since the network can classify other classes too.

How can I train or test for this unbalanced data set? Should I also artificially balance the test set in addition to the training set?

Best Answer

Jain and Nag suggest a balanced training set and a representational test data set for evaluation.

The balanced training set allows for the model to familiarize itself with less frequent state of interest and helps the model to formulate general rules.

However, as @rep_ho points out you should definitely use a test set that represents the population of your data. Otherwise, you would skew your results.

Note though that relying on accuracy as a performance measure in a highly unbalanced dataset can be a misleading metric. If you have a dataset with two groups with a 90/10 split, then the model might simply 'guess' the first category all the time and nevertheless achieve a 90% accuracy.