Solved – How to derive equation of Gini index used in Decision Trees

cartgini

Gini coefficient formally is measured as the area between the equality curve and the Lorenz curve. By using the definition I can derive the equation

enter image description here

However, I can't obtain the exact Gini index equation used in Decision trees.

enter image description here

I would be more than happy if anyone could suggest the way or a resource to learn the derivation of the equation.

Best Answer

<rant>This (like so much in Machine Learning) is an example of an existing word used in a completely different sense.

If $p_j$ is the probability of an item being in category $j$ with $\sum p_j =1$ then $1-\sum p_j^2$ is in fact the complement of the Herfindahl index, and so is a measure of diversity rather than equality. If there are $n$ categories, then $1-\sum p_j^2$ will range between almost $0$ (when the probability is near $1$ that almost everything is in a single category) and $1-\frac1n$ (when each of the $n$ categories is equally likely). By contrast the corresponding the Gini coefficients would be $1-\frac{1}{2n}$ and $0$, so almost the other way round.

Alternatively, consider what happens if you increase $n$ by introducing lots of new $p_k$s close to $0$ to an existing calculation: $1-\sum p_j^2$ will barely change as you have not introduced much new diversity, but the Gini coefficient would fall dramatically as the inequality increased.</rant>


What you have is the so-called Gini impurity, not to be confused with the Gini coefficient.

This impurity is the overall probability of misclassifying an item if you were to lazily classify it based on the probabilities of the overall distribution. You can derive it by saying:

  • The probability you correctly lazily classify an item in category $j$ is $p_j$
  • The probability you lazily misclassify an item actually in category $j$ is $1-p_j$
  • The overall probability you lazily misclassify an item is then $$\sum p_j\left(1-p_j\right) = \sum \left(p_j-p_j^2\right) = \left(\sum p_j\right) - \left(\sum p_j^2\right)= 1 - \sum p_j^2 $$