Solved – Decision trees in smaller datasets

cartr

I have the following dataset from:

 train <- read.csv(url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"))

When I want to make a decision tree using this data I do:

 my_tree_two <- rpart(Survived ~ Sex + Age, data=train, method="class")

This works fine. However I have created a (smaller) subset:

library(dplyr)
t <- select(train, Survived, Sex, Age)
t <- t[c(1:100), ]
t <- filter(t, !is.na(Age))

But now when I want to create a decision tree using

my_tree_two <- rpart(Survived ~ Sex + Age, data=t, method="class")

I only see this:
n= 78

 node), split, n, loss, yval, (yprob)
  * denotes terminal node

  1) root 78 31 0 (0.6025641 0.3974359)  
  2) Sex=male 45  6 0 (0.8666667 0.1333333) *
  3) Sex=female 33  8 1 (0.2424242 0.7575758) *

Could anybody tell me why, with a smaller sample size I only see "Sex" instead of Sex and age

Best Answer

There are default settings that control the splits; you can see these by looking at the documentation for rpart.control.

If you decrease the minbucket size using rpart.control like this:

my_tree_two <- rpart(Survived ~ Sex + Age, data=t, method="class", control=rpart.control(minbucket=2))

Then you'll end up with more splits, including Age.

Related Solutions

Decision Trees – Conditional Inference Trees vs Traditional Decision Trees: A Comparative Study

For what it's worth:

both rpart and ctree recursively perform univariate splits of the dependent variable based on values on a set of covariates. rpart and related algorithms usually employ information measures (such as the Gini coefficient) for selecting the current covariate.

ctree, according to its authors (see chl's comments) avoids the following variable selection bias of rpart (and related methods): They tend to select variables that have many possible splits or many missing values. Unlike the others, ctree uses a significance test procedure in order to select variables instead of selecting the variable that maximizes an information measure (e.g. Gini coefficient).

The significance test, or better: the multiple significance tests computed at each start of the algorithm (select covariate - choose split - recurse) are permutation tests, that is, the "the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points." (from the wikipedia article).

Now for the test statistic: it is computed from transformations (including identity, that is, no transform) of the dependent variable and the covariates. You can choose any of a number of transformations for both variables. For the DV (Dependant Variable), the transformation is called the influence function you were asking about.

Examples (taken from the paper):

if both DV and covariates are numeric, you might select identity transforms and calculate correlations between the covariate and all possible permutations of the values of the DV. Then, you calculate the p-value from this permutation test and compare it with p-values for other covariates.
if both DV and the covariates are nominal (unordered categorical), the test statistic is computed from a contingency table.
you can easily make up other kinds of test statistics from any kind of transformations (including identity transform) from this general scheme.

small example for a permutation test in R:

require(gtools)
dv <- c(1,3,4,5,5); covariate <- c(2,2,5,4,5)
# all possible permutations of dv, length(120):
perms <- permutations(5,5,dv,set=FALSE) 
# now calculate correlations for all perms with covariate:
cors <- apply(perms, 1, function(perms_row) cor(perms_row,covariate)) 
cors <- cors[order(cors)]
# now p-value: compare cor(dv,covariate) with the 
# sorted vector of all permutation correlations
length(cors[cors>=cor(dv,covariate)])/length(cors)
# result: [1] 0.1, i.e. a p-value of .1
# note that this is a one-sided test

Now suppose you have a set of covariates, not only one as above. Then calculate p-values for each of the covariates like in the above scheme, and select the one with the smallest p-value. You want to calculate p-values instead of the correlations directly, because you could have covariates of different kinds (e.g. numeric and categorical).

Once you have selected a covariate, now explore all possible splits (or often a somehow restricted number of all possible splits, e.g. by requiring a minimal number of elements of the DV before splitting) again evaluating a permutation-based test.

ctree comes with a number of possible transformations for both DV and covariates (see the help for Transformations in the party package).

so generally the main difference seems to be that ctree uses a covariate selection scheme that is based on statistical theory (i.e. selection by permutation-based significance tests) and thereby avoids a potential bias in rpart, otherwise they seem similar; e.g. conditional inference trees can be used as base learners for Random Forests.

This is about as far as I can get. For more information, you really need to read the papers. Note that I strongly recommend that you really know what you're doing when you want to apply any kind of statistical analysis.

Solved – Decision Trees on training data

A decision tree trained on a training data set would only have no errors in classification if:

You allowed your tree to have an infinite number of splits.

Theoretically then you could have a large series of branches which lead to terminal nodes that each have one observation and the correct classification on the training data.

This model however, would not generalize to new data. The model would most likely be incredibly poor when applied to new (Test) data and you have overfitted your model. Therefore when building a classification tree model pruning must be performed.

To prune your model you use the complexity parameter which balances the tradeoff between overfitting your model and the missclassification rate

See Using Tree-Based Models in R for a good explanation in R

Also Choosing The Complexity Parameter instructions

Best Answer

Related Solutions

Decision Trees – Conditional Inference Trees vs Traditional Decision Trees: A Comparative Study

Solved – Decision Trees on training data

Related Question