Random Forest – Understanding Gini Decrease and Gini Impurity of Children Nodes

cartfeature selectionrandom forest

I'm working on the Gini feature importance measure for random forest. Therefore, I need to calculate the Gini decrease in node impurity. Here is the way I do so, which leads to a conflict with the definition, suggesting that I must be wrong somewhere… 🙂

For a binary tree, and given the probabilities of left and right children, I can calculate the Gini impurity of a node $n$:

$$ i(n) = 1 – p_l^2 – p_r^2$$

And the Gini decrease:

$$ \Delta i(n) = i(n) – p_li(n_l) – p_ri(n_r) $$

So, for this example with 110 observations on a node:

- node (110)
   - left (100)
      - left_left (60)
      - left_right (40)
   - right (10)
      - right_left (5)
      - right_right (5)

I would calculate the Gini decrease for node like this :

\begin{align}
i({\rm left}) &= 1 – (60/100)^² – (40/100)^²& &= 0.48 \\
i({\rm right}) &= 1 – (5/10)^² – (5/10)^²& &= 0.50 \\
i({\rm node}) &= 1 – (100/110)^² – (10/110)^²& &= 0.16
\end{align}

But following the Breiman definition (or this answer on CV: How to measure / rank "variable importance" when using CART, but I don't have access to the referenced book), the impurity criterion of the descendent should be less than the parent node:

Gini importance
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

Because otherwise, it leads to negative Gini decrease…

$$\Delta i({\rm node}) = i({\rm node}) – (100/110)*i({\rm left}) – (10/110)*i({\rm right}) = -0.32$$

So, if someone could tell where I'm wrong, I would be very grateful because it looks like I miss something evident here…

Best Answer

You simply did not used the target class variable at all. Gini impurity as all other impurity functions, measures impurity of the outputs after a split. What you have done is to measure something using only sample size.

I try to derive formula for your case.

Suppose for simplicity you have have a binary classifier. Denote with $A$ the test attribute, with $C$ the class attribute which have $c_+, c_-$ values.

The initial gini index before split is given by $$I(A) = 1 - P(A_+)^2 - P(A_-)^2$$ where $P(A_+)$ is the proportion of data points which have $c_+$ value for class variable.

Now, impurity for left node would be $$I(Al) = 1 - P(Al_+)^2-P(Al_-)^2$$ $$I(Ar) = 1 - P(Ar_+)^2-P(Ar_-)^2$$ where $P(Al_+)$ is proportion of data points from left subset of $A$ which have value $c_+$ in the class variable, etc.

Now the final formula for GiniGain would be

$$GiniGain(A) = I(A) - p_{left}I(Al) - p_{right}I(Ar)$$ where $p_{left}$ is the proportion of instances for the left subset, or $\frac{\#|Al|}{\#|Al|+\#|Ar|}$ (how many instances are in left subset divided by the total number of instances from $A$.

I feel my notation could be improved, I will watch later when I will have more time.

Conclusion

Using only number of data points is not enough, impurity mean how well one feature (test feature) is able to reproduce the distribution of another feature (class feature). Test feature distribution produces the number you used (how to left, how to right), but distribution of the class feature is not used in your formulas.

Later edit - proove why it decrease

Now I noticed that I missed the part which proves why it always the gini index on child node is less than on parent node. I do not have a complete proove or a verified one, but I am thinking is a valid proof. For other interenting thing related with the topic you might check Technical Note: Some Properties of Splitting Criteria - Leo Breiman. Now it will follow my proof.

Suppose that we are in the binary case, and all the values in a node could be completely described by a pair $(a,b)$ with the meaning of $a$ instances of the first class, and $b$ instances of the second class. We can state than that in the parent node we have $(a,b)$ instances.

In order to find the best split we sort the instances according with a test feature and we try all the binary possible splits. Sorted by a given feature is actually a permutation of instances, in which classes starts with an instance of the first class or of the second class. Without loosing the generality, we will suppose that it starts with an instance of the first class (if this is not the case we have a mirror proof with the same calculation).

The first split to try is in the left $(1,0)$ and in the right $(a-1,b)$ instances. How the gini index for those possible candidates for left and right child nodes are compared with the parent node? Obviously in the left we have $h(left) = 1 - (1/1)^2 - (0/1)^2 = 0$. So on the left side we have a smaller gini index value. How about the right node?

$$h(parent) = 1 - (\frac{a}{a+b})^2 - (\frac{b}{a+b})^2$$ $$h(right) = 1 - (\frac{a-1}{(a-1)+b})^2 - (\frac{b}{(a-1)+b})^2$$

Considering that $a$ is greater or equal than $0$ (since otherwise how could we separate an instance of the first class in the left node?) and after simplification it's simple to see that the gini index for the right node has a smaller value than for the parent node.

Now the final stage of the proof is to node that while considering all the possible split points dictated by the data we have, we keep the one which has the smallest aggregated gini index, which means that the optimum we choose is less or equal than the trivial one which I prooved that is smaller. Which concludes that in the end the gini index will decrease.

As a final conclusion we have to note even if various splits can give values bigger that parent node, the one that we choose will be the smallest among them and also smaller that the parent gini index value.

Hope it helps.

Best Answer

Related Solutions

Cart Likelihood-Ratio Information-Theory – Understanding the Relationship Between GINI Score and Log-Likelihood Ratio

Solved – Relationship between Gini Importance and Prediction Performance (say AUC)

Related Question