Deviance – What is Deviance in CART/rpart Analysis? Understanding the Concept

cartdeviancerrpart

What is "Deviance," how is it calculated, and what are its uses in different fields in statistics?

In particular, I'm personally interested in its uses in CART (and its implementation in rpart in R).

I'm asking this since the wiki-article seems somewhat lacking and your insights will be most welcomed.

Best Answer

Deviance and GLM

Formally, one can view deviance as a sort of distance between two probabilistic models; in GLM context, it amounts to two times the log ratio of likelihoods between two nested models $\ell_1/\ell_0$ where $\ell_0$ is the "smaller" model; that is, a linear restriction on model parameters (cf. the Neyman–Pearson lemma), as @suncoolsu said. As such, it can be used to perform model comparison. It can also be seen as a generalization of the RSS used in OLS estimation (ANOVA, regression), for it provides a measure of goodness-of-fit of the model being evaluated when compared to the null model (intercept only). It works with LM too:

> x <- rnorm(100)
> y <- 0.8*x+rnorm(100)
> lm.res <- lm(y ~ x)

The residuals SS (RSS) is computed as $\hat\varepsilon^t\hat\varepsilon$, which is readily obtained as:

> t(residuals(lm.res))%*%residuals(lm.res)
         [,1]
[1,] 98.66754

or from the (unadjusted) $R^2$

> summary(lm.res)

Call:
lm(formula = y ~ x)

(...)

Residual standard error: 1.003 on 98 degrees of freedom
Multiple R-squared: 0.4234, Adjusted R-squared: 0.4175 
F-statistic: 71.97 on 1 and 98 DF,  p-value: 2.334e-13 

since $R^2=1-\text{RSS}/\text{TSS}$ where $\text{TSS}$ is the total variance. Note that it is directly available in an ANOVA table, like

> summary.aov(lm.res)
            Df Sum Sq Mean Sq F value    Pr(>F)    
x            1 72.459  72.459  71.969 2.334e-13 ***
Residuals   98 98.668   1.007                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Now, look at the deviance:

> deviance(lm.res)
[1] 98.66754

In fact, for linear models the deviance equals the RSS (you may recall that OLS and ML estimates coincide in such a case).

Deviance and CART

We can see CART as a way to allocate already $n$ labeled individuals into arbitrary classes (in a classification context). Trees can be viewed as providing a probability model for individuals class membership. So, at each node $i$, we have a probability distribution $p_{ik}$ over the classes. What is important here is that the leaves of the tree give us a random sample $n_{ik}$ from a multinomial distribution specified by $p_{ik}$. We can thus define the deviance of a tree, $D$, as the sum over all leaves of

$$D_i=-2\sum_kn_{ik}\log(p_{ik}),$$

following Venables and Ripley's notations (MASS, Springer 2002, 4th ed.). If you have access to this essential reference for R users (IMHO), you can check by yourself how such an approach is used for splitting nodes and fitting a tree to observed data (p. 255 ff.); basically, the idea is to minimize, by pruning the tree, $D+\alpha \#(T)$ where $\#(T)$ is the number of nodes in the tree $T$. Here we recognize the cost-complexity trade-off. Here, $D$ is equivalent to the concept of node impurity (i.e., the heterogeneity of the distribution at a given node) which are based on a measure of entropy or information gain, or the well-known Gini index, defined as $1-\sum_kp_{ik}^2$ (the unknown proportions are estimated from node proportions).

With a regression tree, the idea is quite similar, and we can conceptualize the deviance as sum of squares defined for individuals $j$ by

$$D_i=\sum_j(y_j-\mu_i)^2,$$

summed over all leaves. Here, the probability model that is considered within each leaf is a gaussian $\mathcal{N}(\mu_i,\sigma^2)$. Quoting Venables and Ripley (p. 256), "$D$ is the usual scaled deviance for a gaussian GLM. However, the distribution at internal nodes of the tree is then a mixture of normal distributions, and so $D_i$ is only appropriate at the leaves. The tree-construction process has to be seen as a hierarchical refinement of probability models, very similar to forward variable selection in regression." Section 9.2 provides further detailed information about rpart implementation, but you can already look at the residuals() function for rpart object, where "deviance residuals" are computed as the square root of minus twice the logarithm of the fitted model.

An introduction to recursive partitioning using the rpart routines, by Atkinson and Therneau, is also a good start. For more general review (including bagging), I would recommend