Random Forest Predictors – Relative Importance of Predictors in Random Forest Classification in R

classificationmachine learningrrandom forest

I'd like to determine the relative importance of sets of variables toward a randomForest classification model in R. The importance function provides the MeanDecreaseGini metric for each individual predictor–is it as simple as summing this across each predictor in a set?

For example:

# Assumes df has variables a1, a2, b1, b2, and outcome
rf <- randomForest(outcome ~ ., data=df)
importance(rf)
# To determine whether the "a" predictors are more important than the "b"s,
# can I sum the MeanDecreaseGini for a1 and a2 and compare to that of b1+b2?

Best Answer

First I would like to clarify what the importance metric actually measures.

MeanDecreaseGini is a measure of variable importance based on the Gini impurity index used for the calculation of splits during training. A common misconception is that the variable importance metric refers to the Gini used for asserting model performance which is closely related to AUC, but this is wrong. Here is the explanation from the randomForest package written by Breiman and Cutler:

Gini importance
Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

The Gini impurity index is defined as $$ G = \sum_{i=1}^{n_c} p_i(1-p_i) $$ Where $n_c$ is the number of classes in the target variable and $p_i$ is the ratio of this class.

For a two class problem, this results in the following curve which is maximized for the 50-50 sample and minimized for the homogeneous sets: Gini impurity for 2 class

The importance is then calculated as $$ I = G_{parent} - G_{split1} - G_{split2} $$ averaged over all splits in the forest involving the predictor in question. As this is an average it could easily be extended to be averaged over all splits on variables contained in a group.

Looking closer we know each variable importance is an average conditional on the variable used and the meanDecreaseGini of the group would just be the mean of these importances weighted on the share this variable is used in the forest compared to the other variables in the same group. This holds because the the tower property $$ \mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X] $$

Now, to answer your question directly it is not as simple as just summing up all importances in each group to get the combined MeanDecreaseGini but computing the weighted average will get you the answer you are looking for. We just need to find the variable frequencies within each group.

Here is a simple script to get these from a random forest object in R:

var.share <- function(rf.obj, members) {
  count <- table(rf.obj$forest$bestvar)[-1]
  names(count) <- names(rf.obj$forest$ncat)
  share <- count[members] / sum(count[members])
  return(share)
}

Just pass in the names of the variables in the group as the members parameter.

I hope this answers your question. I can write up a function to get the group importances directly if it is of interest.

EDIT:
Here is a function that gives the group importance given a randomForest object and a list of vectors with variable names. It uses var.share as previously defined. I have not done any input checking so you need to make sure you use the right variable names.

group.importance <- function(rf.obj, groups) {
  var.imp <- as.matrix(sapply(groups, function(g) {
    sum(importance(rf.obj, 2)[g, ]*var.share(rf.obj, g))
  }))
  colnames(var.imp) <- "MeanDecreaseGini"
  return(var.imp)
}

Example of usage:

library(randomForest)                                                          
data(iris)

rf.obj <- randomForest(Species ~ ., data=iris)

groups <- list(Sepal=c("Sepal.Width", "Sepal.Length"), 
               Petal=c("Petal.Width", "Petal.Length"))

group.importance(rf.obj, groups)

      MeanDecreaseGini
Sepal         6.187198
Petal        43.913020

It also works for overlapping groups:

overlapping.groups <- list(Sepal=c("Sepal.Width", "Sepal.Length"), 
                           Petal=c("Petal.Width", "Petal.Length"),
                           Width=c("Sepal.Width", "Petal.Width"), 
                           Length=c("Sepal.Length", "Petal.Length"))

group.importance(rf.obj, overlapping.groups)

       MeanDecreaseGini
Sepal          6.187198
Petal         43.913020
Width          30.513776
Length        30.386706

Related Solutions

Solved – Relative variable importance with AIC

This is some further advise/discussion I was given:

AIC RIW can only be calculated from a balanced candidate model set. If you have 3 variables (e.g. repro, time & WR) then the balanced set (without interactions) is

repro
time
WR
repro + time
repro + WR
time + WR
repro + time + WR
intercept only

the number of models in the set is 2 to the power of the number of explanatory variables (in this case = 8) with 2-way interactions your candidate model set ALSO includes the following (i.e. in addition to those above)

repro + time + repro*time
repro + WR + repro*WR
time + WR + time*WR
repro + time + WR + repro*time
repro + time + WR + repro*WR
repro + time + WR + time*WR

If you want the 3-way interaction, then you would ALSO add this to all of the models described above.

Each variable relative importance weight is then the SUM of ALL AIC-weights from models that contain that variable. Because AIC-weights are standardized to sum to one within a candidate model set, then RIW for each variable can range from 0 to 1.

Do not divide the result by the number of models it is contained in – it is the total sum. I would only use these for balanced candidate model sets; I wouldn’t use RIW for a smaller number of models.

NOTE that if you include interactions, then you can only compare the RIWs of main effects with each other, and you can only compare the RIWs of interactions with each other. You cannot compare main effect RIWs with interaction RIWs (because main effects are present in more models than interactions).

FYI: a strong explanatory variable will have a RIW of around 0.9, moderate effects of around 0.6-0.9, very weak effects of around 0.5-0.6 and below that, forget about it. For interactions, a strong effect could be >0.7, moderate >0.5. If you’re not using RIWs then simply look at your model table and see if you get consistent improvements in AIC when you add specific variables, and by how much. Strong effects will often give you improvements in AIC of >5, moderate 2-5 and weak 0-2. If you don’t get an improvement at all, then it isn’t explaining anything.

if you don’t have a balanced candidate set, but DO have the AIC weights (which it appears you do), then you can simply use the ratios of these to determine the strength of support for one model over another. E.g. if you have model 1 with AIC-weight of 0.7 and model 2 with an AIC-weight of 0.15; then model 1 has 4.6 times more support from the data than model 2 (0.7/0.15). You can use this to assess the relative strength of variables as they go in and out of models. But you don’t NEED to do these calculations – and can simply refer the reader to the table. Especially if you have a dominant model; or a series of models at the top that all contain a particular variable. Then it is simply obvious to everyone that it is important.

Cox Regression Analysis – Determining Relative Importance of Variables

Thanks for trying those functions. I believe that both metrics you mentioned are excellent in this context. This is useful for any model that gives rise to Wald statistics (which is virtually all models) although likelihood ratio $\chi^2$ statistics would be even better (but more tedious to compute).

You can use the bootstrap to get confidence intervals for the ranks of variables computed these ways. For the example code type ?anova.rms.

All this is related to the "adequacy index". Two papers using the approach that have appeared in the medical literature are http://www.citeulike.org/user/harrelfe/article/13265566 and http://www.citeulike.org/user/harrelfe/article/13263849 .

Best Answer

Related Solutions

Solved – Relative variable importance with AIC

Cox Regression Analysis – Determining Relative Importance of Variables

Related Question