Solved – a general measure of data-set imbalance

machine learningskewnessunbalanced-classes

I am working on thousands of datasets. Many of them are "unbalanced"; either a multi-class list with highly skewed distribution (For example, three categories with the ratio of 3500:300:4 samples) or a continuous number with skewed distribution.
I am looking for some metric that can say "How badly unbalanced" the dataset is. Is there such a metric?

Eventually, I want to score these datasets according to their balanced metric and provide a different balancing/ machine learning solution for each of them.
I prefer a python solution if it exists.

Best Answer

You could use the Shannon entropy to measure balance.

On a data set of $n$ instances, if you have $k$ classes of size $c_i$ you can compute entropy as follows: $$ H = -\sum_{ i = 1}^k \frac{c_i}{n} \log{ \frac{c_i}{n}}. $$

This is equal to:

  • $0$ when there is one single class. In other words, it tends to $0$ when your data set is very unbalanced
  • $\log{k}$ when all your classes are balanced of the same size $\frac{n}{k}$

Therefore, you could use the following measure of Balance for a data set: $$ \mbox{Balance} = \frac{H}{\log{k}} = \frac{-\sum_{ i = 1}^k \frac{c_i}{n} \log{ \frac{c_i}{n}}. } {\log{k}} $$ which is equal to:

  • $0$ for an unbalanced data set
  • $1$ for a balanced data set
Related Question