Solved – How to handle too many categorical features with too many categories for XGBoost

boostingcategorical datacategorical-encodingclassificationmany-categories

In my data I have 35 features and 14 of them are categorical. Half of them have 3 to 4 categories but others have 14 to 28 categories.

One Hot Encoding them would only lead to a sparse matrix with too many features correlated between them.

Do you know how can I handle this problem ?

Best Answer

There are possibly many ways to tackle this, depending on your data, feature cardinality, etc.:

  • After one-hot-encoding, it may turn out some new features are almost always zero and have negligible statistical significance and you can just drop them
  • Whole features (before encoding) may turn out to be insignificant
  • For some of your categorical features, ordering may actually make sense, like "small,medium,big". In such case, you can just use numerical encoding without increasing number of features
  • You can use binary encoding to reduce dimensionality. There already is an answered question that deals with a somehow similar topic: Binary Encoding vs One hot Encoding

Please refer for this great article for in-depth analysis of different encoding schemes and their performance: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

Related Question