For what it's worth:
both rpart
and ctree
recursively perform univariate splits of the dependent variable based on values on a set of covariates. rpart
and related algorithms usually employ information measures (such as the Gini coefficient) for selecting the current covariate.
ctree
, according to its authors (see chl's comments) avoids the following variable selection bias of rpart
(and related methods): They tend to select variables that have many possible splits or many missing values. Unlike the others, ctree
uses a significance test procedure in order to select variables instead of selecting the variable that maximizes an information measure (e.g. Gini coefficient).
The significance test, or better: the multiple significance tests computed at each start of the algorithm (select covariate - choose split - recurse) are permutation tests, that is, the "the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points." (from the wikipedia article).
Now for the test statistic: it is computed from transformations (including identity, that is, no transform) of the dependent variable and the covariates. You can choose any of a number of transformations for both variables. For the DV (Dependant Variable), the transformation is called the influence function you were asking about.
Examples (taken from the paper):
- if both DV and covariates are numeric, you might select identity transforms and calculate correlations between the covariate and all possible permutations of the values of the DV. Then, you calculate the p-value from this permutation test and compare it with p-values for other covariates.
- if both DV and the covariates are nominal (unordered categorical), the test statistic is computed from a contingency table.
- you can easily make up other kinds of test statistics from any kind of transformations (including identity transform) from this general scheme.
small example for a permutation test in R
:
require(gtools)
dv <- c(1,3,4,5,5); covariate <- c(2,2,5,4,5)
# all possible permutations of dv, length(120):
perms <- permutations(5,5,dv,set=FALSE)
# now calculate correlations for all perms with covariate:
cors <- apply(perms, 1, function(perms_row) cor(perms_row,covariate))
cors <- cors[order(cors)]
# now p-value: compare cor(dv,covariate) with the
# sorted vector of all permutation correlations
length(cors[cors>=cor(dv,covariate)])/length(cors)
# result: [1] 0.1, i.e. a p-value of .1
# note that this is a one-sided test
Now suppose you have a set of covariates, not only one as above. Then calculate p-values for each of the covariates like in the above scheme, and select the one with the smallest p-value. You want to calculate p-values instead of the correlations directly, because you could have covariates of different kinds (e.g. numeric and categorical).
Once you have selected a covariate, now explore all possible splits (or often a somehow restricted number of all possible splits, e.g. by requiring a minimal number of elements of the DV before splitting) again evaluating a permutation-based test.
ctree
comes with a number of possible transformations for both DV and covariates (see the help for Transformations
in the party
package).
so generally the main difference seems to be that ctree
uses a covariate selection scheme that is based on statistical theory (i.e. selection by permutation-based significance tests) and thereby avoids a potential bias in rpart
, otherwise they seem similar; e.g. conditional inference trees can be used as base learners for Random Forests.
This is about as far as I can get. For more information, you really need to read the papers. Note that I strongly recommend that you really know what you're doing when you want to apply any kind of statistical analysis.
A decision tree trained on a training data set would only have no errors in classification if:
- You allowed your tree to have an infinite number of splits.
Theoretically then you could have a large series of branches which lead to terminal nodes that each have one observation and the correct classification on the training data.
This model however, would not generalize to new data. The model would most likely be incredibly poor when applied to new (Test) data and you have overfitted your model. Therefore when building a classification tree model pruning must be performed.
To prune your model you use the complexity parameter which balances the tradeoff between overfitting your model and the missclassification rate
See Using Tree-Based Models in R for a good explanation in R
Also Choosing The Complexity Parameter instructions
Best Answer
There are default settings that control the splits; you can see these by looking at the documentation for rpart.control.
If you decrease the
minbucket
size usingrpart.control
like this:Then you'll end up with more splits, including Age.