Solved – How to handle too many categorical features with too many categories for XGBoost

boostingcategorical datacategorical-encodingclassificationmany-categories

In my data I have 35 features and 14 of them are categorical. Half of them have 3 to 4 categories but others have 14 to 28 categories.

One Hot Encoding them would only lead to a sparse matrix with too many features correlated between them.

Do you know how can I handle this problem ?

There are possibly many ways to tackle this, depending on your data, feature cardinality, etc.:

After one-hot-encoding, it may turn out some new features are almost always zero and have negligible statistical significance and you can just drop them
Whole features (before encoding) may turn out to be insignificant
For some of your categorical features, ordering may actually make sense, like "small,medium,big". In such case, you can just use numerical encoding without increasing number of features
You can use binary encoding to reduce dimensionality. There already is an answered question that deals with a somehow similar topic: Binary Encoding vs One hot Encoding

Please refer for this great article for in-depth analysis of different encoding schemes and their performance: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931