Solved – Training data is imbalanced – but should the validation set also be

cross-validationdatasetmachine learningneural networksunbalanced-classes

I have labelled data composed of 10000 positive examples, and 50000 negative examples, giving a total of 60000 examples. Obviously this data is imbalanced.

Now let us say I want to create my validation set, and I want to use 10% of my data to do so. My question is as follows:

Should I make sure that my validation set is ALSO imbalanced, (as a nod to the true distribution of the training set), or should I make sure that my validation set is balanced? So for example, should my validation set be made from:

  • 10% positive example + 10% negative, giving 1000+ and 5000- examples. (This validation set reflects the original data imbalance).
  • Or should the validation set be made from say, 10% positive, giving 1000+, and (10/5 = 2%) negatives, also giving 1000- examples?

(Same question for the test set).

There seem to be plenty of methods on how to train with imbalanced data, but no where can I see to find best practices on whether or not my validation set should ALSO reflect the original imbalance, or not. Finally, I am NOT doing cross-validation, I will be using a single validation set, and a neural network.

Thanks!

Best Answer

The point of the validation set is to select the epoch/iteration where the neural network is most likely to perform the best on the test set. Subsequently, it is preferable that the distribution of classes in the validation set reflects the distribution of classes in the test set, so that performance metrics on the validation set are a good approximation of the performance metrics on the test set. In other words, the validation set should reflect the original data imbalance.

Related Question