R Random Forest Limitation – Workaround for Handling More Than 32 Levels

many-categoriesrrandom forest

R's randomForest package can not handle factor with more than 32 levels. When it is given more than 32 levels, it emits an error message:

Can not handle categorical predictors with more than 32 categories.

But the data I have has several factors. Some of them have 1000+ levels and some of them have 100+. It even has 'state' of united states which is 52.

So, here's my question.

  1. Why is there such limitation? randomForest refuse to run even for the simple case.

    > d <- data.frame(x=factor(1:50), y=1:50)
    > randomForest(y ~ x, data=d)
      Error in randomForest.default(m, y, ...) : 
      Can not handle categorical predictors with more than 32 categories.
    

    If it is simply due to memory limitation, how can scikit learn's randomForeestRegressor run with more than 32 levels?

  2. What is the best way to handle this problem? Suppose that I have X1, X2, …, X50 independent variables and Y is dependent variable. And suppose that X1, X2 and X3 has more than 32 levels. What should I do?

    What I'm thinking of is running clustering algorithm for each of X1, X2 and X3 where distance is defined as difference in Y. I'll run three clusterings as there are three problematic variables. And in each clustering, I wish I can find similar levels. And I'll merge them.

    How does this sound?

Best Answer

It is actually a pretty reasonable constrain because a split on a factor with $N$ levels is actually a selection of one of the $2^N-2$ possible combinations. So even with $N$ like 25 the space of combinations is so huge that such inference makes minor sense.

Most other implementations simply treat factor as an ordinal one (i.e. integers from 1 to $N$), and this is one option how you can solve this problem. Actually RF is often wise enough to slice this into arbitrary groups with several splits.

The other option is to change representation -- maybe your outcome does not directly depend on state entity but, for instance, area, population, number of pine trees per capita or other attribute(s) you can plug into your information system instead.

It may be also that each state is such an isolated and uncorrelated entity that it requires a separate model for itself.

Clustering based on a decision is probably a bad idea because this way you are smuggling information from the decision into attributes, which often ends in overfitting.