Solved – Decision trees in smaller datasets

cartr

I have the following dataset from:

 train <- read.csv(url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"))

When I want to make a decision tree using this data I do:

 my_tree_two <- rpart(Survived ~ Sex + Age, data=train, method="class")

This works fine. However I have created a (smaller) subset:

library(dplyr)
t <- select(train, Survived, Sex, Age)
t <- t[c(1:100), ]
t <- filter(t, !is.na(Age))

But now when I want to create a decision tree using

my_tree_two <- rpart(Survived ~ Sex + Age, data=t, method="class")

I only see this:
n= 78

 node), split, n, loss, yval, (yprob)
  * denotes terminal node

  1) root 78 31 0 (0.6025641 0.3974359)  
  2) Sex=male 45  6 0 (0.8666667 0.1333333) *
  3) Sex=female 33  8 1 (0.2424242 0.7575758) *

Could anybody tell me why, with a smaller sample size I only see "Sex" instead of Sex and age

Best Answer

There are default settings that control the splits; you can see these by looking at the documentation for rpart.control.

If you decrease the minbucket size using rpart.control like this:

my_tree_two <- rpart(Survived ~ Sex + Age, data=t, method="class", control=rpart.control(minbucket=2))

Then you'll end up with more splits, including Age.

Related Question