Solved – Handling of categorical variables: rpart vs tree

algorithmscartcategorical datacategorical-encodingrpart

For tree and randomForest packages in R, the number of levels for a factor (as a categorical variable) is capped at 32. An explanation might be that the number of comparisons at each split becomes very high (2^32 approximately). Why does rpart still work with a factor with larger no. of levels?

Best Answer

Partially answered in comments:

I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or link for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32.

– Peter Calhoun

For some alternative ideas see Random Forest Regression with sparse data in Python