Deviance and GLM
Formally, one can view deviance as a sort of distance between two probabilistic models; in GLM context, it amounts to two times the log ratio of likelihoods between two nested models $\ell_1/\ell_0$ where $\ell_0$ is the "smaller" model; that is, a linear restriction on model parameters (cf. the Neyman–Pearson lemma), as @suncoolsu said. As such, it can be used to perform model comparison. It can also be seen as a generalization of the RSS used in OLS estimation (ANOVA, regression), for it provides a measure of goodness-of-fit of the model being evaluated when compared to the null model (intercept only). It works with LM too:
> x <- rnorm(100)
> y <- 0.8*x+rnorm(100)
> lm.res <- lm(y ~ x)
The residuals SS (RSS) is computed as $\hat\varepsilon^t\hat\varepsilon$, which is readily obtained as:
> t(residuals(lm.res))%*%residuals(lm.res)
[,1]
[1,] 98.66754
or from the (unadjusted) $R^2$
> summary(lm.res)
Call:
lm(formula = y ~ x)
(...)
Residual standard error: 1.003 on 98 degrees of freedom
Multiple R-squared: 0.4234, Adjusted R-squared: 0.4175
F-statistic: 71.97 on 1 and 98 DF, p-value: 2.334e-13
since $R^2=1-\text{RSS}/\text{TSS}$ where $\text{TSS}$ is the total variance. Note that it is directly available in an ANOVA table, like
> summary.aov(lm.res)
Df Sum Sq Mean Sq F value Pr(>F)
x 1 72.459 72.459 71.969 2.334e-13 ***
Residuals 98 98.668 1.007
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now, look at the deviance:
> deviance(lm.res)
[1] 98.66754
In fact, for linear models the deviance equals the RSS (you may recall that OLS and ML estimates coincide in such a case).
Deviance and CART
We can see CART as a way to allocate already $n$ labeled individuals into arbitrary classes (in a classification context). Trees can be viewed as providing a probability model for individuals class membership. So, at each node $i$, we have a probability distribution $p_{ik}$ over the classes. What is important here is that the leaves of the tree give us a random sample $n_{ik}$ from a multinomial distribution specified by $p_{ik}$. We can thus define the deviance of a tree, $D$, as the sum over all leaves of
$$D_i=-2\sum_kn_{ik}\log(p_{ik}),$$
following Venables and Ripley's notations (MASS, Springer 2002, 4th ed.). If you have access to this essential reference for R users (IMHO), you can check by yourself how such an approach is used for splitting nodes and fitting a tree to observed data (p. 255 ff.); basically, the idea is to minimize, by pruning the tree, $D+\alpha \#(T)$ where $\#(T)$ is the number of nodes in the tree $T$. Here we recognize the cost-complexity trade-off. Here, $D$ is equivalent to the concept of node impurity (i.e., the heterogeneity of the distribution at a given node) which are based on a measure of entropy or information gain, or the well-known Gini index, defined as $1-\sum_kp_{ik}^2$ (the unknown proportions are estimated from node proportions).
With a regression tree, the idea is quite similar, and we can conceptualize the deviance as sum of squares defined for individuals $j$ by
$$D_i=\sum_j(y_j-\mu_i)^2,$$
summed over all leaves. Here, the probability model that is considered within each leaf is a gaussian $\mathcal{N}(\mu_i,\sigma^2)$. Quoting Venables and Ripley (p. 256), "$D$ is the usual scaled deviance for a gaussian GLM. However, the distribution at internal nodes of the tree is then a mixture of normal distributions, and so $D_i$ is only appropriate at the leaves. The tree-construction process has to be seen as a hierarchical refinement of probability models, very similar to forward variable selection in regression." Section 9.2 provides further detailed information about rpart
implementation, but you can already look at the residuals()
function for rpart
object, where "deviance residuals" are computed as the square root of minus twice the logarithm of the fitted model.
An introduction to recursive partitioning using the rpart routines, by Atkinson and Therneau, is also a good start. For more general review (including bagging), I would recommend
Best Answer
Variable importance might generally be computed based on the corresponding reduction of predictive accuracy when the predictor of interest is removed (with a permutation technique, like in Random Forest) or some measure of decrease of node impurity, but see (1) for an overview of available methods. An obvious alternative to CART is RF of course (randomForest, but see also party). With RF, the Gini importance index is defined as the averaged Gini decrease in node impurities over all trees in the forest (it follows from the fact that the Gini impurity index for a given parent node is larger than the value of that measure for its two daughter nodes, see e.g. (2)).
I know that Carolin Strobl and coll. have contributed a lot of simulation and experimental studies on (conditional) variable importance in RFs and CARTs (e.g., (3-4), but there are many other ones, or her thesis, Statistical Issues in Machine Learning – Towards Reliable Split Selection and Variable Importance Measures).
To my knowledge, the caret package (5) only considers a loss function for the regression case (i.e., mean squared error). Maybe it will be added in the near future (anyway, an example with a classification case by k-NN is available in the on-line help for
dotPlot
).However, Noel M O'Boyle seems to have some R code for Variable importance in CART.
References