Solved – an imbalanced data set

classificationdatasetunbalanced-classes

I've seen this term several times already, and have no idea what it means, can you explain what it means ?

Best Answer

Answer: A dataset is imbalanced if the distribution of classes is not uniform.

in other (paper reference) words: Imbalance problem occur where one of the two classes having more sample than other classes.

Problem: The problem arising from imbalance is that accuracy can sometimes be high for a classifier although it does perform rather bad.

Example: A simple example could be found in disease diagnostics. Assume you have a disease which only a very few people actually have. Lets assume you have a dataset of 100 people. 90 people are healthy, 10 are sick. Thus, if you construct a classifier which always predicts "healthy", you will achieve a wonderful accuracy of 90%. Yet in this specific case it is a rather bad performance because all sick people are falsely diagnosed as healthy.

What to do: Here sensitivity and specificity come in handy which are implicitly used when computing the AUC (area under the curve) for evaluation of a classifier. Wikipedia provides great further information.

Related Question