Solved – Entropy Impurity, Gini Impurity, Information gain – differences

cartmachine learningscikit learn

I'm trying to understand the theory behind decision trees (CART) and especially the scikit-learn implementation.

CART constructs binary trees using the feature and threshold that
yield the largest information gain at each node.
scikit-learn documentation

Furthermore it defines Gini Impurity and Entropy Impurity as follows:

Gini: Gini
Entropy: enter image description here

And that I should

select the parameters that minimises the impurity.

However in the specific DecisionTreeClassifier I can choose the criterion:

Supported criteria are “gini” for the Gini impurity and “entropy” for
the information gain.
DecisionTreeClassifier

What I don't understand is that (in my opinion) information gain is the difference of the impurity of the parent node and the weighted average of the left and right childs. This means that I should choose the feature with the highest gain for the split.

Is information gain only applicable for entropy or also for gini impurity? According to the classifier criterion its either gini impurity or entropy information gain which would mean that it either minimizes gini impurity or maximizes information gain?

Best Answer

The algorithm minimizes impurity metric, you select which metric to minimize, either it can be cross-entropy or gini impurity. If you minimize cross-entropy you maximize information gain.

Here you can see the criteria name mapping:

CRITERIA_CLF = {"gini": _criterion.Gini, "entropy": _criterion.Entropy}

And here is their realization. Code for calculation of Entropy impurity, Gini impurity.

In the comments to impurity_improvement method it is stated that:

This method computes the improvement in impurity when a split occurs. The weighted impurity improvement equation is the following: $$ \frac{N_t} {N} * (\text{impurity} - \frac{N_{tR}}{ N_t} * \text{right_impurity}- \frac{N_{tL}} {N_t} * \text{left_impurity})$$ where N is the total number of samples, $N_t$ is the number of samples at the current node, $N_{tL}$ is the number of samples in the left child, and $N_{tR}$ is the number of samples in the right child

Algorithm then selects feature with highest improvement (i.e. highest gain in impurity reduction).