Solved – Equal number of training instances of each classification label

classificationmachine learningnaive bayes

I am using Naive Bayes to perform binary classification. In my training set, the two class labels occur with probability Pr(label A) = 0.95 and Pr(label B) = 0.05.

Should I prune the training set so that there is an equal number of training instances of each label? Does the answer apply to any classifier, not just Naive Bayes?

Best Answer

It really depends on what your ultimate goal is. If you just care about overall accuracy and the class priors you observe in your training set are a good estimate of what you are likely to see in the world, then you should not do anything to your data. It is worth noting that you will likely end up with a classifier which overwhelmingly predicts $A$, but this is what you would expect and makes sense from a decision theory point of view.

On the other hand, if you care about things like precision and recall for both classes, or you think the observed class priors are not truly that biased, you will need to do something to deal with the class imbalance. Rather than repeat here, I'll point you to this answer I previously posted regarding methods to deal with class imbalance.

As to the last part of your question, the answer is yes, this answer applies to classifiers in general and not just Naive Bayes.

Related Question