Solved – Relationship between Gini Importance and Prediction Performance (say AUC)

feature selectionrandom forest

I want to use the decrease in Gini impurity to rank features for my random forest classifier. I understand that the decrease in Gini impurity at one node is calculated as:

$$ \Delta i(n) = i(n) – p_li(n_l) – p_ri(n_r) $$

The overall decrease in Gini impurity is summed over all nodes and all trees for a given node ref. I don't quite understand if there is a link between decrease in Gini impurity and the prediction performance? That is, Gini impurity says which features are more important relative to others. But can I deduce how much individual features will affect prediction performance given the Gini impurity? I have read the following posts:

Gini decrease and Gini impurity of children nodes

What is the relationship between the GINI score and the log-likelihood ratio

Best Answer

I'm not sure there is a great answer to this questions. But maybe

As far as I know there were four measures of variable importance in the original Breiman paper, with only two making it into the randomForest package. But the permutation variable importance appears to be much more popular, likely because it is much easier to understand how this might produce intuitive measures of variable importance and how these measures of variable importance relate to predictive ability of the model. My understanding is that valSelRF, Boruta and conditional variable importance in party all use the permutation variable importance.

Variable importance are suggestive, but hard to make inferences from output. That might be why there are so many ways of calculating variable importance. the relaimpo package has six measures. Both the Boruta and relaimpo package vignettes are worth reading. Discuss subject at length. The relaimpo vignette especially emphasizes limitations of methods and the conflicting results you can obtain. The methods may be efficient at finding all-relevant feature, but often produce conflicting results when ranking feature. (You'll find the values also change when you change the cost function. )

Related Question