I am working on thousands of datasets. Many of them are "unbalanced"; either a multi-class list with highly skewed distribution (For example, three categories with the ratio of 3500:300:4 samples) or a continuous number with skewed distribution.
I am looking for some metric that can say "How badly unbalanced" the dataset is. Is there such a metric?
Eventually, I want to score these datasets according to their balanced metric and provide a different balancing/ machine learning solution for each of them.
I prefer a python solution if it exists.
Best Answer
You could use the Shannon entropy to measure balance.
On a data set of $n$ instances, if you have $k$ classes of size $c_i$ you can compute entropy as follows: $$ H = -\sum_{ i = 1}^k \frac{c_i}{n} \log{ \frac{c_i}{n}}. $$
This is equal to:
Therefore, you could use the following measure of Balance for a data set: $$ \mbox{Balance} = \frac{H}{\log{k}} = \frac{-\sum_{ i = 1}^k \frac{c_i}{n} \log{ \frac{c_i}{n}}. } {\log{k}} $$ which is equal to: