Decision Trees – Conditional Inference Trees vs Traditional Decision Trees: A Comparative Study

cartmachine learningr

Can anyone explain the primary differences between conditional inference trees (ctree from party package in R) compared to the more traditional decision tree algorithms (such as rpart in R)?

  1. What makes CI trees different?
  2. Strengths and weaknesses?

Update: I have looked at the paper by Horthorn et al that Chi refers to in the comments. I was not able to follow it completely – can anyone explain how variables are selected using permutations (e.g. what is an influence function)?

Best Answer

For what it's worth:

both rpart and ctree recursively perform univariate splits of the dependent variable based on values on a set of covariates. rpart and related algorithms usually employ information measures (such as the Gini coefficient) for selecting the current covariate.

ctree, according to its authors (see chl's comments) avoids the following variable selection bias of rpart (and related methods): They tend to select variables that have many possible splits or many missing values. Unlike the others, ctree uses a significance test procedure in order to select variables instead of selecting the variable that maximizes an information measure (e.g. Gini coefficient).

The significance test, or better: the multiple significance tests computed at each start of the algorithm (select covariate - choose split - recurse) are permutation tests, that is, the "the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points." (from the wikipedia article).

Now for the test statistic: it is computed from transformations (including identity, that is, no transform) of the dependent variable and the covariates. You can choose any of a number of transformations for both variables. For the DV (Dependant Variable), the transformation is called the influence function you were asking about.

Examples (taken from the paper):

  • if both DV and covariates are numeric, you might select identity transforms and calculate correlations between the covariate and all possible permutations of the values of the DV. Then, you calculate the p-value from this permutation test and compare it with p-values for other covariates.
  • if both DV and the covariates are nominal (unordered categorical), the test statistic is computed from a contingency table.
  • you can easily make up other kinds of test statistics from any kind of transformations (including identity transform) from this general scheme.

small example for a permutation test in R:

require(gtools)
dv <- c(1,3,4,5,5); covariate <- c(2,2,5,4,5)
# all possible permutations of dv, length(120):
perms <- permutations(5,5,dv,set=FALSE) 
# now calculate correlations for all perms with covariate:
cors <- apply(perms, 1, function(perms_row) cor(perms_row,covariate)) 
cors <- cors[order(cors)]
# now p-value: compare cor(dv,covariate) with the 
# sorted vector of all permutation correlations
length(cors[cors>=cor(dv,covariate)])/length(cors)
# result: [1] 0.1, i.e. a p-value of .1
# note that this is a one-sided test

Now suppose you have a set of covariates, not only one as above. Then calculate p-values for each of the covariates like in the above scheme, and select the one with the smallest p-value. You want to calculate p-values instead of the correlations directly, because you could have covariates of different kinds (e.g. numeric and categorical).

Once you have selected a covariate, now explore all possible splits (or often a somehow restricted number of all possible splits, e.g. by requiring a minimal number of elements of the DV before splitting) again evaluating a permutation-based test.

ctree comes with a number of possible transformations for both DV and covariates (see the help for Transformations in the party package).

so generally the main difference seems to be that ctree uses a covariate selection scheme that is based on statistical theory (i.e. selection by permutation-based significance tests) and thereby avoids a potential bias in rpart, otherwise they seem similar; e.g. conditional inference trees can be used as base learners for Random Forests.

This is about as far as I can get. For more information, you really need to read the papers. Note that I strongly recommend that you really know what you're doing when you want to apply any kind of statistical analysis.

Related Question