Solved – Naive Bayes: Imbalanced Dataset in Real-time Scenario

classificationmachine learningnaive bayesscikit learntext mining

I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and the results are very promising.

This classifer needs to run in real time and constantly analyze documents thrown at it randomly.

However, when I run my classifier in production, the number of false positives is very high and therefore I end up with a very low precision. The reason is simple: there are many more negative samples that the classifer encounters in the real-time scenario (around 90 % of the time) and this does not correspond to the ideal balanced dataset I used for testing and training.

Is there a way I can simulate this real-time case during training or are there any tricks that I can use (including pre-processing on the documents to see if they are suitable for the classifer)?

I was planning to train my classifier using an imbalanced dataset with the same proportions as I have in real-time case but I am afraid that might bias Naive Bayes towards the negative class and lose the recall I have on the positive class.

Any advice is appreciated.

Best Answer

To create a good model, the model has to be built on training data which is of the same "structure" as the data the model will applied later on. This is the one boring assumption which underlies all classification models.

So by using an balanced data set meanwhile the real world is not balanced, you have already introduced a bias. While there are cases where this is not a problem (imagine perfectly separable (non-linear) classes, a model built on a balanced data set containing all border-relevant points will be still working perfectly on a skewed sample), classifying documents is often a game of probabilities and hence class skew is more problematic.

My suggestions:

  • Built the model on the imbalanced set with the same proportions as in production. If you have to sample for this, then perform multiple runs across different samples during validation to improve generalization power.
  • The "bias" towards the negative class in an imbalanced set originates from the-best-guess-is-majority-class-if-everything-else-is-equal, something which Naive Bayes is sensitive to (especially when a lot of (irrelevant) features are involved). Use a different classifier which can capture inter-feature/word-dependencies to reduce this. I'd try Gradient Boosting with trees as described in chapter 10 "Boosting and Additive Trees" of The elements of statistical learning.
  • You are currently using "plain precision / recall" as metric. Based on your productions requirements, estimate whether a false positive is equally bad as a false negative and adjust the metric accordingly.