Solved – Selection bias in trees

biascart

In Applied Predictive Modeling by Kuhn and Johnson the authors write:

Finally, these trees suffer from selection bias: predictors with a
higher number of distinct values are favored over more granular
predictors (Loh and Shih,   1997;   Carolin et   al.,   2007;   Loh,  
2010). Loh and Shih  ( 1997) remarked that “The danger occurs when a
data set consists of a mix of informative and noise variables, and the
noise variables have many more splits than the informative variables.
Then there is a high probability that the noise variables will be
chosen to split the top nodes of the tree. Pruning will produce either
a tree with misleading structure or no tree at all.”

Kuhn, Max; Johnson, Kjell (2013-05-17). Applied Predictive Modeling
(Kindle Locations 5241-5247). Springer New York. Kindle Edition.

They go on to describe some research into building unbiased trees. For example Loh's GUIDE model.

Staying as strictly as possible within the CART framework, I'm wondering if there's anything I can do to minimize this selection bias? For example, perhaps clustering/grouping high cardinality predictors is one strategy. But to what degree should one do the grouping? If I have a predictor with 30 levels should I group to 10 levels? 15? 5?

Best Answer

Based on your comment I'd go with a conditional inference framework. The code is readily available in R using the ctree function in the party package. It has unbiased variable selection, and while the algorithm underlying when and how to make splits is different compared to CART, the logic is essentially the same. Another benefit outlined by the authors (see the paper here) is that you don't have to worry so much about pruning the tree to avoid overfitting. The algorithm actually takes care of that by using permutation tests to determine whether a split is "statistically significant" or not.

Related Question