Solved – Balanced datasets in Naive Bayes

classificationnaive bayesunbalanced-classes

In a classification model, a desirable situation is to have classification classes evenly represented in the training dataset. Datasets that satisfy this property are called balanced datasets.

However, in a Naive Bayes classification model, the classifier is defined as an optimization problem that maximizes the posterior probability:

argmax_C P(C|F1,...,Fn) = P(C) Sum_i(P(F_i|C))

where F_i are features and C are classes (in this equation the Naive assumption has already been applied).

But, if we try to get balanced datasets with evenly represented categories, then the estimation of P(C) (the prior) would be the same for all C and, thus, we could get rid of P(C) when we maximize because it'd be the same for all categories.

Further, by considering evenly represented categories we would be altering the real distribution of the class.

My question is: are we really interested in doing that, or do we want to capture the fact that some classes are more likely than others, in our classification model (keeping the dataset unbalanced)?

Best Answer

There are two types of classification model, generative model and discriminative model.

Naive Bayes is a generative model, and to train Naive Bayes, your training data should be generated by the true process, and future data will be generated by that process as well. Balancing the data isn't part of the true process, so you can't do that.

On the other hand, if you are training a discriminative model (eg logistic regression), then in some cases, you might want to balance your data. One common reason is that the minority class is of more importance and by balancing, you get a better performance for that class. Manipulating data is a dangerous practice and you should know very sure why you do that.