Solved – variable importance in R randomForest package

importancerandom forest

I have a few questions regarding the variable importance in random forest. The importance function outputs two types of importance measures (1 = mean decrease in accuracy, 2 = mean decrease in node impurity). For the 2nd measure, the manual says:

The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees.

  1. Does “over all trees” actually mean “over all trees where that predictor is used as a splitter”?

  2. At each split, what’s the criteria to choose which predictor to use as a splitter? Could a predictor be used more than once as splitters in the same tree?

  3. Is it guaranteed that each predictor got at least one chance to be used as splitter in the building of the forest? If not, what would be that predictor’s importance value?

Best Answer

  1. Does “over all trees” actually mean “over all trees where that predictor is used as a splitter”?

My understanding is that the sum is over all nodes where the predictor variable is used. In fact, a predictor variable can be used more than once in a given tree to split a node or not at all.

  1. At each split, what’s the criteria to choose which predictor to use as a splitter? Could a predictor be used more than once as splitters in the same tree?

My understanding is that at each node, a random subset of predictors is selected. From these predictors, the one that most reduces the impurity is then selected. ``Impurity'' can be quantified by different measures. Two that I have come across are the Gini impurity and the entropy measure.

  1. Is it guaranteed that each predictor got at least one chance to be used as splitter in the building of the forest? If not, what would be that predictor’s importance value?

I believe that there is a chance that a predictor is never used (although very small chance if you build a lot of trees). It that case, the importance of type 2 is zero.