Solved – Naive-Bayes classifier for unequal groups

machine learningnaive bayesunbalanced-classes

I'm using naive bayes classifier to classify between two groups of data.
One group of the data is much larger than the other (above 4 times).
I'm using the prior probability of each group in the classifier.

The problem is that the result I get has 0% true positive rate and 0% false positive rate.
I got the same results when I set the prior to 0.5 and 0.5 .

How can I set my threshold to something better so I could get a more balanced results?

I had a similar problem when using Logistic Regression classifier. I solved it by subtracting the prior term from the bias.

When I use Fisher Linear Discriminant on this data, I get good results with the threshold set in the middle.

I assume there is some common solution to this problem, I just couldn't find it.

UPDATE: I've just noticed that I the classifier is overfitting. The performance on the training set is perfect (100% correct).

If I use equal groups, then the classifier starts classifying to the "small" group as well, but the performance is pretty bad (worse than FLD or LR).

UPDATE2: I think the problem was that I was using full covariance matrix. Running with diagonal covariance matrix gave me more "balanced" results.

Best Answer

Assigning all patterns to the negative class certainly is not a "wierd result". It could be that the Bayes optimal classifier always classifies all patterns as belonging to the majority class, in which case your classifier is doing exactly what it should do. If the density of patterns belonging to the positive class never exceeds the density of the patterns belonging to the negative class, then the negative class is more likely regardless of the attribute values.

The thing to do in such circumstances is to consider the relative importance of false-positive and false-negative errors, it is rare in practice that the costs of the two different types of error are the same. So determine the loss for false positive and false negative errors and take these into account in setting the threshold probability (differing misclassification costs is equivalent to changing the prior probabilities, so this is easy to implement for naive Bayes). I would recommend tuning the priors to minimise the cross-validation estimate of the loss (incorporating your unequal misclassification costs).

If your misclassification costs are equal, and your training set priors representative of operational conditions, then assuming that your implementation is correct, it is possible that you already have the best NB classifier.