Solved – Gini Index Formula

cartgini

I've read many related articles and posts. The more I read, the more I got confused about 'Gini index' and 'Gini Impurity'. I understood the concept but it seems to me that these things are used differently by different people.
ISLR book* (page 326) defines Gini Index as $\sum p_i(1– p_i)$ or $1 – \sum p_i^2$.

enter image description here

However, this (and many other articles) [the Same question has been asked in comments too by Shanu_not answered though] compute Gini by $ p^2+q^2$ formula for Binary classifier.

So, their Gini Impurity [ 1 $-$ Gini Index] is exactly the same as the Gini Index computed as per ISLR book.

Please let me know what am I missing. I realize that reading concepts after a long break is painful.

*Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction to statistical learning : with applications in R. New York :Springer,

Best Answer

Usually, the terms Gini Index and Gini Impurity are used as synonyms. Indeed, when defined as $1-\sum p_i^2 $ it measures impurity - in the sense that it increases with impurity.

To me it looks like the link you gave uses an alternative, rather confusing definition, where they use Gini Index as a measure of purity, and Gini Impurity as 1-Index. This is something I had never seen in the literature, and it does not seem to recur anywhere else (I took a quick tour of links and definitions, and I could not find it anywhere else).

Therefore, I would rather use the definition you can find in Hastie/Tibshirani's book, as it is the most common. Indeed, we can trace that definition it back to Classification And Regression Trees (Breiman, 1984):

4.3.3 The Gini Criterion
[...]
In later work the Gini diversity index was adopted. This has the form: $$i(t) = \sum_{i\neq j}p(i|t)p(j|t)$$ and can also be written as $$i(t) = [...] =1-\sum_jp(j|t)^2$$

The original name is therefore Gini (diversity) index, but since it is a measure of impurity you may also call it Gini impurity.

Related Question