Machine Learning – Balancing Accuracy and Misclassification Costs When Over/Under-Sampling Unbalanced Classes

classificationmachine learningunbalanced-classes

First of all, I would like to describe some common layouts that Data Mining books use explaining how to deal with Unbalanced Datasets. Usually the main section is named Unbalanced Datasets and they cover these two subsections: Cost-Sensitive Classification and Sampling Techniques.

It seems that facing a problem with a rare class you can perform both cost-sensitive classification and sampling. Instead, I think that one should apply cost-sensitive techniques if the rare class is also the target of the classification and a misclassification of a record of that class is costly.

On the other hand, sampling techniques, such as over-sampling and under-sampling, are useful if the target of the classification is a good accuracy overall, without focusing on a particular class.

This belief comes from the rationale of MetaCost that is a general way to make a classifier cost-sensitive: if one want to make a classifier cost-sensitive in order to penalise a misclassification error of the rare class he should over-sample the other class. Roughly speaking, the classifier tries to adapt to the other class and it becomes specific to the rare class.

This is the opposite of over-sampling the rare class, that is the usually suggested way to deal with this problem. Over-sampling of the rare class or the under-sampling of the other class is useful to improve the overall accuracy.

Please, it would be great if you confirmed my thoughts.

Stated this, the common question facing an unbalanced dataset is:

Should I try to get a dataset that counts as many rare records as other ones?

My answer would be, in case you are looking for accuracy: OK. You can perform it either finding out more rare class examples or deleting some records of the other class.

In case you are focusing on the rare class, with a cost-sensitive technique, I would answer: you can only find out more rare class example but you shouldn't delete records of the other class. In the latter case you will not be able to let the classifier adapt to the other class, and the rare class misclassification error could increase.

What would you answer?

Best Answer

It's a good question. Personally, my answer would be that it never makes sense to throw data away (unless it is for computational reasons), as the more data you have, the better your model of the world can be. Therefore, I would suggest that modifying the cost function in appropriate way for your task should be sufficient. For example, if you are interested in one particular rare class, you can make misclassifications of this class only more expensive; if you are interested in a balanced measure, something like Balanced Error Rate (the average of the errors on each class) or the Matthews Correlation Coefficient is appropriate; if you are interested only in overall classification error, the traditional 0-1 loss.

A modern approach to the problem is to use Active Learning. For example, Hospedales et al (2011) "Finding Rare Classes: Active Learning with Generative and Discriminative Models, IEEE Transactions on Knowledge and Data Engineering, (TKDE 2011). However I believe the these approaches are still relatively less mature.

Related Question