Solved – How to make predictions using multiclass unbalanced data


I am trying to predict crimes (san francisco) using machine learning algorithms. It is a multi class classification problem with unbalanced data.

I took sample of data ranging from years 2010 to 2015 with 10 crimes (10 classes with varying distribution). I kept data from 2010 to 2014 for training and 2015 for testing.

Since it is unbalanced I did under sampling on majority class and over sampling on almost five minority classes in my training set. I used random forest as my primary algorithm.

I tried to predict test set with my model. My test set is still unbalanced but I get poor accuracy. I also tried adaboost and multinomial logistic regression, but to no use.

I did 10-fold stratified sampling on the training set. I got good accuracy but it is of no use, since I duplicated the minority classes as the process of over sampling.

I also tried log-loss, f1_score (weighted, micro and macro) as my performance metrics, but I didn't get a satisfying result.

Question: How can I proceed further? What else can I try?

Best Answer

The following tactics may be useful:

  • Different Algorithms: Decision trees often perform well on imbalanced datasets. The splitting rules that look at the class variable used in the creation of the trees, can force both classes to be addressed. You can try a few popular decision tree algorithms like C4.5, C5.0, CART, and Random Forest.
  • Generating Synthetic Samples: A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

    You could sample them empirically within your dataset or you could use a method like Naive Bayes that can sample each attribute independently when run in reverse. You will have more and different data, but the non-linear relationships between the attributes may not be preserved.

    You can try the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to increase the amount of the samples in the minority class with an oversampling method.

  • Different Perspective: Actually, there are fields of study dedicated to imbalanced datasets. For example, you might like to consider are anomaly detection and change detection instead of classification.

  • Performance Metrics: Accuracy is not the true metric to use when working with an imbalanced dataset. Following performance measures that can give more insight into the accuracy of the model than traditional metrics: confusion matrix, precision, recall, f-score, Kappa (or Cohen's Kappa), and ROC curves.

