Solved – How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R)

cartclassificationmodel selectionrrpart

When building a CART model (specifically classification tree) using rpart (in R), it is often interesting to know what is the importance of the various variables introduced to the model.

Thus, my question is: What common measures exists for ranking/measuring variable importance of participating variables in a CART model? And how can this be computed using R (for example, when using the rpart package)

For example, here is some dummy code, created so you might show your solutions on it. This example is structured so that it is clear that variable x1 and x2 are "important" while (in some sense) x1 is more important then x2 (since x1 should apply to more cases, thus make more influence on the structure of the data, then x2).

set.seed(31431)
n <- 400
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- rnorm(n)

X <- data.frame(x1,x2,x3,x4,x5)

y <- sample(letters[1:4], n, T)
y <- ifelse(X[,2] < -1 , "b", y)
y <- ifelse(X[,1] < 0 , "a", y)

require(rpart)
fit <- rpart(y~., X)
plot(fit); text(fit)

info.gain.rpart(fit) # your function - telling us on each variable how important it is

(references are always welcomed)

Best Answer

Variable importance might generally be computed based on the corresponding reduction of predictive accuracy when the predictor of interest is removed (with a permutation technique, like in Random Forest) or some measure of decrease of node impurity, but see (1) for an overview of available methods. An obvious alternative to CART is RF of course (randomForest, but see also party). With RF, the Gini importance index is defined as the averaged Gini decrease in node impurities over all trees in the forest (it follows from the fact that the Gini impurity index for a given parent node is larger than the value of that measure for its two daughter nodes, see e.g. (2)).

I know that Carolin Strobl and coll. have contributed a lot of simulation and experimental studies on (conditional) variable importance in RFs and CARTs (e.g., (3-4), but there are many other ones, or her thesis, Statistical Issues in Machine Learning – Towards Reliable Split Selection and Variable Importance Measures).

To my knowledge, the caret package (5) only considers a loss function for the regression case (i.e., mean squared error). Maybe it will be added in the near future (anyway, an example with a classification case by k-NN is available in the on-line help for dotPlot).

However, Noel M O'Boyle seems to have some R code for Variable importance in CART.

References

  1. Sandri and Zuccolotto. A bias correction algorithm for the Gini variable importance measure in classification trees. 2008
  2. Izenman. Modern Multivariate Statistical Techniques. Springer 2008
  3. Strobl, Hothorn, and Zeilis. Party on!. R Journal 2009 1/2
  4. Strobl, Boulesteix, Kneib, Augustin, and Zeilis. Conditional variable importance for random forests. BMC Bioinformatics 2008, 9:307
  5. Kuhn. Building Predictive Models in R Using the caret Package. JSS 2008 28(5)