While discussing about decision trees in class, my teacher touched upon the topic of entropy. I have understood the purpose of entropy (have not understood how the formula $H(X)= -\sum_{i}{p(x_i) \log p(x_i)}$ comes from).
But anyways I was wondering if there is a simpler way to say that having (3 blue m&ms, 3 red m&ms, 3 orange m&ms, 3 yellow m&ms) is higher entropy than having (1 blue m&m, 2 red m&ms, 3 orange m&ms, 6 yellow m&ms).
Why can't we just compute the standard deviation? Higher the standard deviation, lesser the entropy.
If I were to do it here,
-
CASE1 : 1 blue m&m, 2 red m&ms, 3 orange m&ms, 6 yellow m&ms
- $\bar{x} = (1+2+3+6)/4 = 3$
- $s_x = (1-3)^2 + (2-3)^2 + (3-3)^2 + (3-6)^2 = 14$
-
CASE2 : 3 blue m&ms, 3 red m&ms, 3 orange m&ms, 3 yellow m&ms
- $\bar{x} = (3+3+3+3)/4 == 3$
- $s_x = (3-3)^2 + (3-3)^2 + (3-3)^2 + (3-3)^2 = 0$
Once again, lesser the SD, more the entropy, which holds true here.
Best Answer
Here's why. Let's compare the formulas for entropy and variance:
So note that entropy does not care about values that $X$ may take, it cares only about the distribution itself, while variance does care about the values of $X$. Also, for variance the variable has to be numeric, and it's not the case for entropy. Both these properties make entropy a good candidate for calculating the information gain.
To get more insights into entropy and other information-theoretic measures, you may read this question on math.SE.