Categorical Data – Best Practices for Coding Features for Decision Trees

boostingcartcategorical datarandom forest

When coding categorical features for linear regression, there is a rule: number of dummies should be one less than the total number of levels (to avoid collinearity).

Does there exist a similar rule for Decision Trees (bagged, boosted)? I am asking this because a standard practice in Python seems to be to expand n levels into n dummies (sklearns' OneHotEncoder or Pandas' pd.get_dummies ) which appears suboptimal to me.

What would you suggest as best practices for coding Categorical features for Decision Trees?

Best Answer

It seems like you understand that you're able to have n levels, as opposed to n-1, because unlike in linear regression you don't need to worry about perfect colinearity.

(I'm coming at this from an R perspective, but I assume it's the same in Python.) That depends on a couple of things, such as 1) which package you're using and 2) how many factor levels you have.

1) If you are using R's randomForest package, then if you have <33 factor levels then you can go ahead and leave them in one feature if you want. That's because in R's random forest implementation, it will check to see which factor levels should be on one side of the split and which on the other (e.g., 5 of your levels might be grouped together on the left side, and 7 might be grouped together on the right). If you split the categorical feature out into n dummies, then the algorithm would not have this option at its disposal.

Obviously if the particularly package you're using can't handle categorical features then you'd just need to create n dummy variables.

2) As I alluded to above, R's random forest implementation can only handle 32 factor levels - if you have more than that then you either need to split your factors into smaller subsets, or create a dummy variable for each level.

Related Question