Solved – dealing with imbalanced data set in multiclass text classification

multi-classunbalanced-classes

I need to build a text classification model.

I have a labeled training set and my goal is to classify the new unlabeled text
.

My training set is composed on 6 categories, that are imbalanced.

The categories are distributed as follows:
Category 1 -> 450 examples
Category 2 -> 400 examples
Category 3 -> 250 examples
Category 4 -> 150 examples
Category 5 -> 100 examples
Category 6 -> 50 examples

How to deal with such imbalanced multi class text classification?

Best Answer

Generally, you should:

Sampling
Adjust your performance metrics (like F1 rather than just accuracy)
Choose a cost-sensitive algorithm, for example, adding weights to the minority classes
Algorithms such as decision tree, boosting etc. They are more adopted to imbalanced data set.

Related Solutions

Solved – Vowpal Wabbit: best strategy for short text data like titles & kewords

Here are some tips for enhancing the performance of VW models:

Shuffle the data prior to training. Having a non-random ordering of your dataset can really mess VW up.
You're already using multiple passes, which is good. Try also decaying the learning rate between passes, with --decay_learning_rate=.95.
Play around with the learning rate. I've had cases where --learning_rate=10 was great and other cases where --learning_rate-0.001 was great.
Try --oaa 16 or --log_multi 16 rather than --ect 16. I usually find ect to be less accurate. However, oaa is pretty slow. I've found --log_multi to be a good compromise between speed and accuracy. On 10,000 training examples, --oaa 16 should be fine.
Play with the loss function. --loss_function=hinge can sometimes yield large improvements in classification models.
Play with the --l1 and --l2 parameters, which regularize your model. --l2 in particular is useful with text data. Try something like --l2=1e-6.
For text data, try --ngram=2 and --skips=2 to add n-gram and skip grams to your models. This can help a lot.
Try --autolink=2 or --autolink=3 to fit a quadratic or cubic spline model.
Try ftrl optimization with --ftrl. This can be useful with text data or datasets with some extremely rare and some extremely common features.
Try some learning reductions:
1. Try a shallow neural network with --nn=1 or --nn=10.
2. Try a radial kernel svm with --ksvm --kernel=rbf --bandwidth=1. (This can be very slow).
3. Try a polynomial kernel svm with --ksvm --kernel=poly --degree=3. (This can be very slow).
4. Try a gbm with --boosting=25. This can be a little slow.

VW is extremely flexible, so it often takes a lot of fine tuning to get a good model on a given dataset. You can get a lot more tuning ideas here: https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments

Regarding the post you linked to: that person used vw with squared loss on an unbalanced classification problem. That's a silly thing to do, and pretty much guarantees that any linear model will always predict the dominant class. If you're worried about class balance, VW supports weights, so you can over-weight the rarer classes.

Edit: You have 100 classes and 10,000 training examples? That's an average of 100 observations per class, which isn't that many to learn from, no matter what model you use.

Solved – Imbalanced multiclass classification with many classes

There is no real answer to your question, because it really depends on what you are trying to archive, i.e. is your goal to get a very high classification accuracy or is it rather data exploration?

If you are purely interested in the classification, you should ask yourself the following questions:

Do I expect the same class priors for new samples? If yes, any over or under-sampling will lead to a bad model by definition, since you essentially train the model on a different distribution.
What are the consequences of misclassifying a sample? In many cases, the cost of misclassifying a sample is not the same for all classes, e.g. falsely assign a model to the 'bad document' class might have less sever consequences than assigning it to other classes.

Generally, a model will always try to minimize the loss and it doesn't care how this is archived. In a balanced context, this is solely done by learning correlation between predictors and the response, however in cases of class imbalance, the model will also learn the prior distribution, which is independent of the predictors. This is not a misbehavior of the model in case the actual distribution has these priors! (In this context I want to link a very good answer by Stephan Kolassa about the general issues when evaluating models based on accuracy.)

If you are less interested in the actual classification but more in question such as 'what are the main predictors for the response?', 'do predictors interact?' or 'how big is the deterministic component / the learnability of this problem?', it can make sense to balance classes such that the model doesn't learn the priors but rather the associations between predictors and response, since those could be mask in by class imbalance, especially if you deal with sparse data. However, keep in mind that the resulting model is unfit for classifying data following the original distribution.

Best Answer

Related Solutions

Solved – Vowpal Wabbit: best strategy for short text data like titles & kewords

Solved – Imbalanced multiclass classification with many classes

Related Question