Solved – Is it possible to use SD instead of entropy

cartentropyinformation theorystandard deviationunsupervised learning

While discussing about decision trees in class, my teacher touched upon the topic of entropy. I have understood the purpose of entropy (have not understood how the formula $H(X)= -\sum_{i}{p(x_i) \log p(x_i)}$ comes from).

But anyways I was wondering if there is a simpler way to say that having (3 blue m&ms, 3 red m&ms, 3 orange m&ms, 3 yellow m&ms) is higher entropy than having (1 blue m&m, 2 red m&ms, 3 orange m&ms, 6 yellow m&ms).

Why can't we just compute the standard deviation? Higher the standard deviation, lesser the entropy.

If I were to do it here,

  • CASE1 : 1 blue m&m, 2 red m&ms, 3 orange m&ms, 6 yellow m&ms

    • $\bar{x} = (1+2+3+6)/4 = 3$
    • $s_x = (1-3)^2 + (2-3)^2 + (3-3)^2 + (3-6)^2 = 14$
  • CASE2 : 3 blue m&ms, 3 red m&ms, 3 orange m&ms, 3 yellow m&ms

    • $\bar{x} = (3+3+3+3)/4 == 3$
    • $s_x = (3-3)^2 + (3-3)^2 + (3-3)^2 + (3-3)^2 = 0$

Once again, lesser the SD, more the entropy, which holds true here.

Best Answer

Why can't we just compute the standard deviation?

Here's why. Let's compare the formulas for entropy and variance:

  • $H(X) = - \sum\limits_x p(x) \, \log p(x) = - \mathbb E \, [ \log p(X) ]$
  • $\text{var} (X) = \mathbb E \, \Big[(X - \mathbb E[X])^2 \Big]$

So note that entropy does not care about values that $X$ may take, it cares only about the distribution itself, while variance does care about the values of $X$. Also, for variance the variable has to be numeric, and it's not the case for entropy. Both these properties make entropy a good candidate for calculating the information gain.

To get more insights into entropy and other information-theoretic measures, you may read this question on math.SE.

Related Question