Decision Trees – Differences in Implementation of Binary Splits

cartpartitioningrpart

I am curious about the practical implementation of a binary split in a decision tree – as it relates to levels of a categorical predictor $X{j}$.

Specifically, I often will utilize some sort of sampling scheme (e.g. bagging, oversampling etc) when building a predictive model using a decision tree – in order to improve its predictive accuracy and stability. During these sampling routines, it is possible for a categorical variable to be presented to a tree fitting algorithm with less than the complete level set.

Say a variable X takes on levels {A,B,C,D,E}. In a sample, maybe only levels {A,B,C,D} are present. Then, when the resulting tree is used for prediction, the full set may be present.

Continuing from this example, say a tree splits on X and sends {A,B} to the left and {C,D} to the right. I would expect the logic of the binary split to then say, when faced with new data: "If X has value A or B, send to the left, otherwise, send this case to the right". What seems to happen in some implementations is "if X has value A or B, send to the left, if X has value C or D send to the right". When this case takes on value E, the algorithm breaks down.

What is the "right" way for a binary split to be handled? It seems the much more robust way is implemented often, but not always (see Rpart below).

Here are a couple examples:

Rpart fails, the others are ok.

#test trees and missing values

summary(solder)
table(solder$PadType)

# create train and validation
set.seed(12345)
t_rows<-sample(1:nrow(solder),size=360, replace=FALSE)
train_solder<-solder[t_rows,]
val_solder<-solder[-t_rows,]

#look at PadType
table(train_solder$PadType)
table(val_solder$PadType)
#set a bunch to missing
levels(train_solder$PadType)[train_solder$PadType %in% c('L8','L9','W4','W9')] <- 'MISSING'


#Fit several trees, may have to play with the parameters to get them to split on the variable

####RPART
mod_rpart<-rpart(Solder~PadType,data=train_solder)
predict(mod_rpart,val_solder)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object,  : 
#factor 'PadType' has new level(s) D6, L6, L7, L8, L9, W4

####TREE
mod_tree<-tree(Solder~PadType,data=train_solder,split="gini")
predict(mod_tree,val_solder) #works fine

####ctree
mod_ctree<-ctree(Solder~PadType,data=train_solder,control = ctree_control(mincriterion = 0.05))
predict(mod_ctree,val_solder) #works fine

Best Answer

In fact there are two types of factors -- ordered (like Tiny < Small < Medium < Big < Huge) and unordered (Cucumber, Carrot, Fennel, Aubergine).
First class is the same as continuous ones -- there is only easier to check all pivots, there is also no problem with extending levels list.
For the second class, you have to make a set of elements that will be directed in one branch, leaving the rest into the other -- in this case you can either:

  1. throw error
  2. assume that unseen class goes into your favorite branch
  3. treat this as NA and select branch in more-less random way.

Now, direct treating of unordered factors is tricky so many algorithms "cheat" and claim unordered factors as ordered ones, so they don't even touch this problem*. The rest usually use int mask, i.e. optimize an integer number from $1$ to $2^{\text{#categories}-1}-1$ and treat $i$-th bit as a branch selection for a factor level $i$. (Ever wondered why there is frequent limit to 32 levels?) In this setting, it is quite natural that unseen levels go silently into "0" branch. Yet this does not seem too "right", because why really should we do this? And what about the number of levels required to entropy leverage in attribute selection?

I would say the the most sensible idea is to make the user define the full set of factors (for instance R does this organically, preserving levels through subset operations) and use option 1. for not-declared levels and option 2. for declared ones. Option 3. may make sense if you already have some NA handling infrastructure.

*) There is also side strategy to do some non-trivial re-coding of levels into numbers, like for instance Breiman encoding -- yet this generates even more problems.