Solved – a general measure of data-set imbalance

machine learningskewnessunbalanced-classes

I am working on thousands of datasets. Many of them are "unbalanced"; either a multi-class list with highly skewed distribution (For example, three categories with the ratio of 3500:300:4 samples) or a continuous number with skewed distribution.
I am looking for some metric that can say "How badly unbalanced" the dataset is. Is there such a metric?

Eventually, I want to score these datasets according to their balanced metric and provide a different balancing/ machine learning solution for each of them.
I prefer a python solution if it exists.

Best Answer

You could use the Shannon entropy to measure balance.

On a data set of $n$ instances, if you have $k$ classes of size $c_i$ you can compute entropy as follows: $$ H = -\sum_{ i = 1}^k \frac{c_i}{n} \log{ \frac{c_i}{n}}. $$

This is equal to:

$0$ when there is one single class. In other words, it tends to $0$ when your data set is very unbalanced
$\log{k}$ when all your classes are balanced of the same size $\frac{n}{k}$

Therefore, you could use the following measure of Balance for a data set: $$ \mbox{Balance} = \frac{H}{\log{k}} = \frac{-\sum_{ i = 1}^k \frac{c_i}{n} \log{ \frac{c_i}{n}}. } {\log{k}} $$ which is equal to:

$0$ for an unbalanced data set
$1$ for a balanced data set

Best Answer

Related Solutions

Solved – Problem with classifier after using SMOTE to balance the data

Solved – Should I balance the classifier train/test set, if metrics is Precision/Recall (F1 score)

Related Question