Solved – How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R)

cartclassificationmodel selectionrrpart

When building a CART model (specifically classification tree) using rpart (in R), it is often interesting to know what is the importance of the various variables introduced to the model.

Thus, my question is: What common measures exists for ranking/measuring variable importance of participating variables in a CART model? And how can this be computed using R (for example, when using the rpart package)

For example, here is some dummy code, created so you might show your solutions on it. This example is structured so that it is clear that variable x1 and x2 are "important" while (in some sense) x1 is more important then x2 (since x1 should apply to more cases, thus make more influence on the structure of the data, then x2).

set.seed(31431)
n <- 400
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- rnorm(n)

X <- data.frame(x1,x2,x3,x4,x5)

y <- sample(letters[1:4], n, T)
y <- ifelse(X[,2] < -1 , "b", y)
y <- ifelse(X[,1] < 0 , "a", y)

require(rpart)
fit <- rpart(y~., X)
plot(fit); text(fit)

info.gain.rpart(fit) # your function - telling us on each variable how important it is

(references are always welcomed)

Best Answer

Variable importance might generally be computed based on the corresponding reduction of predictive accuracy when the predictor of interest is removed (with a permutation technique, like in Random Forest) or some measure of decrease of node impurity, but see (1) for an overview of available methods. An obvious alternative to CART is RF of course (randomForest, but see also party). With RF, the Gini importance index is defined as the averaged Gini decrease in node impurities over all trees in the forest (it follows from the fact that the Gini impurity index for a given parent node is larger than the value of that measure for its two daughter nodes, see e.g. (2)).

I know that Carolin Strobl and coll. have contributed a lot of simulation and experimental studies on (conditional) variable importance in RFs and CARTs (e.g., (3-4), but there are many other ones, or her thesis, Statistical Issues in Machine Learning – Towards Reliable Split Selection and Variable Importance Measures).

To my knowledge, the caret package (5) only considers a loss function for the regression case (i.e., mean squared error). Maybe it will be added in the near future (anyway, an example with a classification case by k-NN is available in the on-line help for dotPlot).

However, Noel M O'Boyle seems to have some R code for Variable importance in CART.

References

Sandri and Zuccolotto. A bias correction algorithm for the Gini variable importance measure in classification trees. 2008
Izenman. Modern Multivariate Statistical Techniques. Springer 2008
Strobl, Hothorn, and Zeilis. Party on!. R Journal 2009 1/2
Strobl, Boulesteix, Kneib, Augustin, and Zeilis. Conditional variable importance for random forests. BMC Bioinformatics 2008, 9:307
Kuhn. Building Predictive Models in R Using the caret Package. JSS 2008 28(5)

Related Solutions

Solved – Estimate the “meaningful” predictors for a value in a CART model (rpart)

Tal,

My understanding based on this article, is that you cannot obtain the individual variables related to each separate class when the class is a factor. However, using rpart.control, you are able to identify the variables that are important at each node.

http://cran.r-project.org/web/packages/caret/vignettes/caretVarImp.pdf

Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned on using the maxcompete argument in rpart.control. This method does not currently provide class specific measures of importance when the response is a factor.

Deviance – What is Deviance in CART/rpart Analysis? Understanding the Concept

Deviance and GLM

Formally, one can view deviance as a sort of distance between two probabilistic models; in GLM context, it amounts to two times the log ratio of likelihoods between two nested models $\ell_1/\ell_0$ where $\ell_0$ is the "smaller" model; that is, a linear restriction on model parameters (cf. the Neyman–Pearson lemma), as @suncoolsu said. As such, it can be used to perform model comparison. It can also be seen as a generalization of the RSS used in OLS estimation (ANOVA, regression), for it provides a measure of goodness-of-fit of the model being evaluated when compared to the null model (intercept only). It works with LM too:

> x <- rnorm(100)
> y <- 0.8*x+rnorm(100)
> lm.res <- lm(y ~ x)

The residuals SS (RSS) is computed as $\hat\varepsilon^t\hat\varepsilon$, which is readily obtained as:

> t(residuals(lm.res))%*%residuals(lm.res)
         [,1]
[1,] 98.66754

or from the (unadjusted) $R^2$

> summary(lm.res)

Call:
lm(formula = y ~ x)

(...)

Residual standard error: 1.003 on 98 degrees of freedom
Multiple R-squared: 0.4234, Adjusted R-squared: 0.4175 
F-statistic: 71.97 on 1 and 98 DF,  p-value: 2.334e-13

since $R^2=1-\text{RSS}/\text{TSS}$ where $\text{TSS}$ is the total variance. Note that it is directly available in an ANOVA table, like

> summary.aov(lm.res)
            Df Sum Sq Mean Sq F value    Pr(>F)    
x            1 72.459  72.459  71.969 2.334e-13 ***
Residuals   98 98.668   1.007                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, look at the deviance:

> deviance(lm.res)
[1] 98.66754

In fact, for linear models the deviance equals the RSS (you may recall that OLS and ML estimates coincide in such a case).

Deviance and CART

We can see CART as a way to allocate already $n$ labeled individuals into arbitrary classes (in a classification context). Trees can be viewed as providing a probability model for individuals class membership. So, at each node $i$, we have a probability distribution $p_{ik}$ over the classes. What is important here is that the leaves of the tree give us a random sample $n_{ik}$ from a multinomial distribution specified by $p_{ik}$. We can thus define the deviance of a tree, $D$, as the sum over all leaves of

$$D_i=-2\sum_kn_{ik}\log(p_{ik}),$$

following Venables and Ripley's notations (MASS, Springer 2002, 4th ed.). If you have access to this essential reference for R users (IMHO), you can check by yourself how such an approach is used for splitting nodes and fitting a tree to observed data (p. 255 ff.); basically, the idea is to minimize, by pruning the tree, $D+\alpha \#(T)$ where $\#(T)$ is the number of nodes in the tree $T$. Here we recognize the cost-complexity trade-off. Here, $D$ is equivalent to the concept of node impurity (i.e., the heterogeneity of the distribution at a given node) which are based on a measure of entropy or information gain, or the well-known Gini index, defined as $1-\sum_kp_{ik}^2$ (the unknown proportions are estimated from node proportions).

With a regression tree, the idea is quite similar, and we can conceptualize the deviance as sum of squares defined for individuals $j$ by

$$D_i=\sum_j(y_j-\mu_i)^2,$$

summed over all leaves. Here, the probability model that is considered within each leaf is a gaussian $\mathcal{N}(\mu_i,\sigma^2)$. Quoting Venables and Ripley (p. 256), "$D$ is the usual scaled deviance for a gaussian GLM. However, the distribution at internal nodes of the tree is then a mixture of normal distributions, and so $D_i$ is only appropriate at the leaves. The tree-construction process has to be seen as a hierarchical refinement of probability models, very similar to forward variable selection in regression." Section 9.2 provides further detailed information about rpart implementation, but you can already look at the residuals() function for rpart object, where "deviance residuals" are computed as the square root of minus twice the logarithm of the fitted model.

An introduction to recursive partitioning using the rpart routines, by Atkinson and Therneau, is also a good start. For more general review (including bagging), I would recommend

Moissen, G.G. (2008). Classification and Regression Trees. Ecological Informatics, pp. 582-588.
Sutton, C.D. (2005). Classification and Regression Trees, Bagging, and Boosting, in Handbook of Statistics, Vol. 24, pp. 303-329, Elsevier.

Best Answer

Related Solutions

Solved – Estimate the “meaningful” predictors for a value in a CART model (rpart)

Deviance – What is Deviance in CART/rpart Analysis? Understanding the Concept

Related Question