Machine Learning – Improve Classification with Many Categorical Variables

categorical dataclassificationmachine learningmany-categoriesrandom forest

I'm working on a dataset with 200,000+ samples and approximately 50 features per sample: 10 continuous variables and the other ~40 are categorical variables (countries, languages, scientific fields etc.). For these categorical variables, you have for example 150 different countries, 50 languages, 50 scientific fields etc…

So far my approach is:

  1. For each categorical variable with many possible value, take only the one having more than 10000 sample that takes this value. This reduces to 5-10 categories instead of 150.

  2. Build dummy variable for each categorical one (if 10 countries then for each sample add a binary vector of size 10).

  3. Feed a random forest classifier (cross-validate the parameters etc…) with this data.

Currently with this approach, I only manage to get 65% accuracy and I feel like more can be done. Especially I'm not satisfied with my 1) since I feel like I shouldn't arbitrarily remove the "least relevant values" according the the number of sample they have, because these less represented values could be more discriminative. On the other hand, my RAM can't afford adding 500 columns * 200000 rows to the data by keeping all possible values.

Would you have any suggestion to cope with this much categorical variables?

Best Answer

  1. Random forests should be able to handle categorical values natively so look for a different implementation so you don't have to encode all of those features and use up all your memory.

  2. The problem with high cardinality categorical features is that it is easy to over fit with them. You may have enough data that this isn't an issue but watch out for it.

  3. I suggest looking into random forest based feature selection using either the method Breiman proposed or artificial contrasts. The artificial contrasts method (ACE) is interesting because it compares the importance of the feature to the importance of a shuffled version of itself which fights some of the high cardinality issues. There is a new paper "Module Guided Random Forests" which might be interesting if you had many more features as it uses a feature selection method that is aware of groups of highly correlated features.

  4. Another sometime used option is to tweak the algorithm so it uses the out of bag cases to do the final feature selection after fitting the splits on the in bag cases which sometimes helps fight overfitting.

There is an almost complete ace implementation here and I have a more memory efficient/fast RF implementation that handles categorical variables natively here...the -evaloob option supports option 4 I'm working on adding support for ACE and a couple of other RF based feature selection methods but it isn't done yet.

Related Question