Solved – Which classifiers work well with unbalanced data

binary dataclassificationneural networksunbalanced-classes

I have a binary classification problem which is very unbalanced – it can have 98% of data from one class. Which classifiers work well with this sort of data?

I have an unlimited supply of training data, since I produce it using a pseudo random number generator. However, I found that to get a neural network to produce decent results, I had to generate balanced (50:50) data. This is the equivalent of over-sampling. The problem with this approach is that the training data is then not representative of real life.

Best Answer

Some options:

  • Do not use accuracy alone as a metric. That way, we would get 98% accuracy with everything classified as the majority class, which would not mean anything. Precision & Recall might be a better one.
  • You could try using a Cost sensitive classifier through which you can state the cost of misclassification of the different classes.
  • Use an SVM but penalize one of the classes which can be done using LibSVM
  • boost the number of minority class training examples by artificially creating new samples from the existing samples.
  • resample the set, to have a proportional number of samples in both the classes (probably not an option in your case)
Related Question