Solved – Decision Tree – Splitting Factor Variables

cartcategorical datacategorical-encodingrrpart

I'm new to decision trees and I have some confusion about how factor variables and non-ordered character/string variables get handled in a split.

Suppose I have a factor such as "tiny, small, medium, large, huge" where the levels are important. How does a decision tree try to find the best split? Will it only check the 4 obvious splits, or will it check splits for weird combinations like, "tiny or huge but not small medium or large"?

Similarly, how does a decision tree check for a split for an unordered character variable such as "New Orleans, Birmingham, Jackson, Miami, Atlanta"?

I'm using the rpart package in R as I try to learn this stuff, so any references to rpart's implementation would be helpful.

Best Answer

rpart treats differently ordinal and nominal cualitative variables (factors, in R parlance). For your first variable, provided it has been defined as an ordered factor, the only splits considered would be:

  • {tiny} {small, medium, large, huge},
  • {tiny,small} {medium, large, huge},
  • {tiny,small,medium} {large, huge}
  • {tiny,small,medium,large}{huge}

while for a purely nominal variable, all $2^{k-1} -1$ posible splits ($k$ = number of levels) would be tested. Of course, this cannot be done with $k$ very large, so you might have to compromise aggregating levels.

Related Question