Solved – Computing the Gini index

data mininggini

enter image description here

How do I compute the Gini index using Instance attribute as attribute test condition?

I calculated the Gini, but I have no clue how to do it for this Instance attribute.

$$\text{Gini for } a_1 = 0.345 $$
$$\text{Gini for } a_2 = 0.493 $$
$$\text{Gini for } a_3 = ?$$

I am guessing the answer to this is that Instance attribute has no information gain. However, I can't prove this.

$$\text{Gini} = 1 – \sum_i p(i|t)^2$$

Best Answer

Gini index here ($G$, say) just calculates diversity or heterogeneity (or uncertainty if you will) from the sum of squared category probabilities. If every value is in the same category, then the measure is $1 - 1^2 = 0$. If every value of $n$ values is in a distinct category, then the measure is $1 - n(1/n)^2 = 1 - 1/n$. The complement is in some ways easier to think about, e.g. the reciprocal of the complement $1 / (1 - G)$ returns the "numbers equivalent", i.e. the equivalent number of equally common classes. Thus, the extremes for that are clearly $1/1$ and $1/(1/n)$, i.e. $1$ and $n$.

Your columns $a_1$ and $a_2$ have 4 T and 5 F and 5T and 4F, respectively, which I get to be the same index, namely $1 - (4/9)^2 - (5/9)^2 = .4938271605$; that's a ridiculous number of decimal places, but it suggests that you have a gross error for one column and a rounding error for the other. With your $a_3$ the principle does not change, as the index ignores labels on the categories: whatever metric meaning they might have is not considered. By my calculation you have $1 - 5((1/9)^2) - 2 ((2/9)^2) = .8395061728$.

Other names for this measure $G$ (or its complement, or the reciprocal of that) are Simpson, Herfindahl and repeat rate. Gini appears to have got there first, but its applications across ecology, economics, linguistics and many other fields are legion.

Related Question