Solved – dealing with imbalanced data set in multiclass text classification

multi-classunbalanced-classes

I need to build a text classification model.

I have a labeled training set and my goal is to classify the new unlabeled text
.

My training set is composed on 6 categories, that are imbalanced.

The categories are distributed as follows:
Category 1 -> 450 examples
Category 2 -> 400 examples
Category 3 -> 250 examples
Category 4 -> 150 examples
Category 5 -> 100 examples
Category 6 -> 50 examples

How to deal with such imbalanced multi class text classification?

Best Answer

Generally, you should:

  • Sampling
  • Adjust your performance metrics (like F1 rather than just accuracy)
  • Choose a cost-sensitive algorithm, for example, adding weights to the minority classes
  • Algorithms such as decision tree, boosting etc. They are more adopted to imbalanced data set.
Related Question