Decision Trees – Differences in Implementation of Binary Splits

cartpartitioningrpart

I am curious about the practical implementation of a binary split in a decision tree – as it relates to levels of a categorical predictor $X{j}$.

Specifically, I often will utilize some sort of sampling scheme (e.g. bagging, oversampling etc) when building a predictive model using a decision tree – in order to improve its predictive accuracy and stability. During these sampling routines, it is possible for a categorical variable to be presented to a tree fitting algorithm with less than the complete level set.

Say a variable X takes on levels {A,B,C,D,E}. In a sample, maybe only levels {A,B,C,D} are present. Then, when the resulting tree is used for prediction, the full set may be present.

Continuing from this example, say a tree splits on X and sends {A,B} to the left and {C,D} to the right. I would expect the logic of the binary split to then say, when faced with new data: "If X has value A or B, send to the left, otherwise, send this case to the right". What seems to happen in some implementations is "if X has value A or B, send to the left, if X has value C or D send to the right". When this case takes on value E, the algorithm breaks down.

What is the "right" way for a binary split to be handled? It seems the much more robust way is implemented often, but not always (see Rpart below).

Here are a couple examples:

Rpart fails, the others are ok.

#test trees and missing values

summary(solder)
table(solder$PadType)

# create train and validation
set.seed(12345)
t_rows<-sample(1:nrow(solder),size=360, replace=FALSE)
train_solder<-solder[t_rows,]
val_solder<-solder[-t_rows,]

#look at PadType
table(train_solder$PadType)
table(val_solder$PadType)
#set a bunch to missing
levels(train_solder$PadType)[train_solder$PadType %in% c('L8','L9','W4','W9')] <- 'MISSING'


#Fit several trees, may have to play with the parameters to get them to split on the variable

####RPART
mod_rpart<-rpart(Solder~PadType,data=train_solder)
predict(mod_rpart,val_solder)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object,  : 
#factor 'PadType' has new level(s) D6, L6, L7, L8, L9, W4

####TREE
mod_tree<-tree(Solder~PadType,data=train_solder,split="gini")
predict(mod_tree,val_solder) #works fine

####ctree
mod_ctree<-ctree(Solder~PadType,data=train_solder,control = ctree_control(mincriterion = 0.05))
predict(mod_ctree,val_solder) #works fine

Best Answer

In fact there are two types of factors -- ordered (like Tiny < Small < Medium < Big < Huge) and unordered (Cucumber, Carrot, Fennel, Aubergine).
First class is the same as continuous ones -- there is only easier to check all pivots, there is also no problem with extending levels list.
For the second class, you have to make a set of elements that will be directed in one branch, leaving the rest into the other -- in this case you can either:

throw error
assume that unseen class goes into your favorite branch
treat this as NA and select branch in more-less random way.

Now, direct treating of unordered factors is tricky so many algorithms "cheat" and claim unordered factors as ordered ones, so they don't even touch this problem*. The rest usually use int mask, i.e. optimize an integer number from $1$ to $2^{\text{#categories}-1}-1$ and treat $i$-th bit as a branch selection for a factor level $i$. (Ever wondered why there is frequent limit to 32 levels?) In this setting, it is quite natural that unseen levels go silently into "0" branch. Yet this does not seem too "right", because why really should we do this? And what about the number of levels required to entropy leverage in attribute selection?

I would say the the most sensible idea is to make the user define the full set of factors (for instance R does this organically, preserving levels through subset operations) and use option 1. for not-declared levels and option 2. for declared ones. Option 3. may make sense if you already have some NA handling infrastructure.

*) There is also side strategy to do some non-trivial re-coding of levels into numbers, like for instance Breiman encoding -- yet this generates even more problems.

Related Solutions

Solved – How should decision tree splits be implemented when predicting continuous variables

One potential issue with trees is that they tend to fit poorly in the tails. Think of a terminal node that captures the low range of the training set. It will predict using the mean of those training set points, which will always under-predict the outcome (since it is the mean).

You might try model trees [1]. These will fit linear models in the terminal nodes and (I think) do a better job than regression trees. Better yet, use a more evolved version called Cubist that combines different approaches ([1] and [2] below).

These models also handle continuous and discrete predictors differently. They can do multi-way splits for categorical variables. The splitting criterion is very similar to CART trees.

Model trees can be found in R in the RWeka package (called 'M5P') and Cubist is in the Cubist package. Of course, you can use Weka too and Cubist has a C version available at the RuleQuest website.

[1] Quinlan, J. (1992). Learning with continuous classes. Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, 343–348.

[2] Quinlan, J. (1993). Combining instance-based and model-based learning. Proceedings of the Tenth International Conference on Machine Learning, 236–243.

Solved – Predict sales levels with decision trees

Given that you have monthly data, you can model the seasonality using dummy variables, e.g.:

foo <- data.frame(sales=rnorm(48,10,10),
month=rep(c("Jan","Feb","Mar","Apr","May","Jun",
    "Jul","Aug","Sep","Oct","Nov","Dec"),4))
model <- lm(sales~month,data=foo)
predict(model,newdata=data.frame(month="Dec"))

However, a more common approach would be to use seasonal exponential smoothing. See, e.g., here.

And I think that your weather data will be completely useless: with monthly data, temperature and rainfall will be collinear with the month dummies, and weather simply varies too much within months. In particular: for a six-month ahead forecast you would need to forecast temperature and rainfall also for six months ahead, which is probably not possible to get better than "it will be June, so it will probably be warmer than today".

Best Answer

Related Solutions

Solved – How should decision tree splits be implemented when predicting continuous variables

Solved – Predict sales levels with decision trees

Related Question