Naive Bayes – Handling Unbalanced Classes in Naive Bayes Classification

classificationnaive bayesnatural language

As a part of a project for the university is should train a Naive Bayes classifier to classify question and answers in three different categories, the task should be easy since that the 3 classes are really different between each other.

Dataset

the dataset is a mixture of questions and answers from different domains (furniture and a C++ course) and they are in different language (Italian and English) so at the first sight it should be easy to classify them.

The only problem with the dataset is that is really unbalanced, like
C++ course 2700 training instance.
furniture English 200 training instance.
furniture Italian 60 training instance.

Feature Extraction

The feature are simply n-gram counts i don't remove stopword because i have 2 different language to work with.

Using TF-IDF features and stemmed token i obtained lower results.

The NB algorithm

I have implemented the naive bayes by myself but it obtains the same result of the scikit learn one.
i have trained it with per class prior and a smoothing using alpha=.5

The results

The result at the end was in some sense good

              precision    recall  f1-score   support

      0           0.92      1.00      0.96       318
      1           1.00      0.50      0.67        44
      2           1.00      0.33      0.50         6

avg / total       0.93      0.93      0.92       368

But the only draw back is that the recall on the 1 and 2 class is low, and the reason is simple (we have unbalanced classes).

There is a way to actually obtain lower but more balanced results having the constraint of using a single model trained with Naive Bayes?

Cheers


Edit:
My needs of having more balanced results is due to the fact that I should use this classifier as part of an Question Answering task.

Best Answer

Tackling the Poor Assumptions of Naive Bayes Text Classiffiers suggests some modifications to Naive Bayes in order to correct for biased sample sets.

Also have a look at this (and similar) CV posts on class imbalance, unbalanced class labels, etc.