Solved – the difference between Gini index and Gini coefficient

I am building a decision tree from scratch. I have been using entropy so far (calculated this way):

def calculate_entropy(data):

    label_column = data[:,-1]
    _,counts = np.unique(label_column,return_counts=True)

    probabilities = counts/counts.sum()
    entropy = sum(probabilities*(-np.log2(probabilities)))

    return entropy

Now I want to implement a function that calculates the Gini index of an array. From here I get the following function that calculates:

def gini_coeff(x):
    # requires all values in x to be zero or positive numbers,
    # otherwise results are undefined
    n = len(x)
    s = x.sum()
    r = argsort(argsort(-x)) # calculates zero-based ranks
    return 1 - (2.0 * (r*x).sum() + s)/(n*s)

That for

x = [1,1,2]
gini_coeff(np.array(x))

0.16666666666666663

But from this other source I see that the Gini impurity can be calculated

$ G = \sum_i^n \cdot p(i) \cdot (1-p(i))$

That I implemented this way

def calculate_gini(data):
    label_column = data
    _,counts = np.unique(label_column,return_counts=True)

    probabilities = counts/counts.sum()
    return sum(probabilities*(1-probabilities))

calculate_gini(x)

0.4444444444444444

I believe that I am making here a conceptual problem. What is it?
What are the differences? Which one should I use for a decision tree?

Best Answer

The Gini coefficient measures dispersion of non-negative values in such a fashion that Gini coefficient = 0 describes perfect equality (zero variation of values), and Gini coefficient = 1 describes 'maximal inequality' where all individuals (units, etc.) have value zero, and all non-zero value is concentrated in a single individual. This was developed in an economic context to describe income distribution inequality.

By contrast the Gini impurity measures "how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset." If your work is CART based (as your tag suggests), you are interested in the Gini impurity, not the Gini coefficient.

The Gini impurity for a single label $i$ is the probability of that item occurring time the probability of some other label $\ne i$ occurring, which gives $p_{i}(1-p_{i})$. However, the Gini impurity for a set of labels ($J$ many labels) is the sum of this $p_{i}(1-p_{i})$ term across all labels, which simplifies to $1-\sum_{i=1}^{J}{p_{i}^{2}}$

Your code seems, to my admittedly weak coding eye, to capture the definition for a set of labels.

Best Answer

Related Solutions

Solved – What methods exist for finding optimal splits to discretize continuous data with respect to a target variable

Cart Likelihood-Ratio Information-Theory – Understanding the Relationship Between GINI Score and Log-Likelihood Ratio

Related Question