Solved – Maximum number of classes for RandomForest multiclass estimation

classificationmachine learningmulti-classrandom forest

I have researched the internet|literature a lot on multiclass prediction to find out what is a realistic limit for the number of classes that can successfully be used for estimation when using a RandomForest method.

The literature body on text mining sometimes comes up with really large numbers of classes (>1000), while most other "classical" cases described have a class count less than 6-8. Most of them describe handmade algorithms specifically designed for the particular problem, though, while I am interested in the performance of standard RF implementations (in R, for example).

I have even started to analyse simulated data to find out more about it, but the problem is to generate data that simulates a lot of multiple classes yet has meaningful and realistic predictors.

I know that the results depend largely on the number of observations in every class and the balance between class outcomes. For my data, I can safely assume that there will be enough observations per class, so that I can balance the data accordingly.

So I am curious whether people have applied standard RandomForest implementations to multiclass problems with a class count >>10. Note that I am not talking about separating the estimation into multiple one-vs-all problems.

Does anybody here have some real-life experience with that kind of data?

Best Answer

I have at least one experience doing so. For the NHTS 2017 dataset, I have modeled a number of variables. Notably, random forests perform quite well on predicting vehicle ownership per household (using most of the other household-level variables as features), somewhat outperforming logit models (which are, for whatever reason, state-of-the-art in travel modeling). There are a dozen classes here.

On the other hand, modeling individuals' work schedules (jointly hour leaving to go to work and hour leaving from work) has a large quantity of combinations. After some data preprocessing, there are over 200 classes. Random forest models perform abysmally here, in terms of accuracy. I get about 20% accuracy for an RF model with optimized max depth, and almost 60% accuracy for a logistic regression. Interestingly, the log loss of the RF model is still lower than that of the logistic model.

These results ended up as an extended abstract at TRB. You can read the paper unpaywalled here

Related Question