Solved – (Low cardinality) categorical features handling in gradient boosting libraries

boostingcatboostcategorical datacategorical-encoding

In some popular gradient boosting libraries (lgb, catboost), they all seems like can handle categorical inputs by just specifying the column names of the categorical features, and pass it into a fit or model instance by setting it to categorical_feature=.

(except in xgboost:

Categorical features not supported

Note that XGBoost does not support categorical features; if your data c ontains categorical features, load it as a NumPy array first and then perform one-hot encoding.)

When the cardinality is high, I can intuitively understand the advantages of employing some tricks of handling the cardinality (maybe like mean encoding) comparing to one-hot encoding, since one-hot would just result a huge sparse matrix, and it's probably difficult for the model to learn well without developing a deep tree.

My questions are:

1) Why xgboost does not handle categorical features natively? Is it because the advantages of doing all these manipulations in lgb and catboost is actually not significant in practice?

2) For lgb and catboost, when the cardinality in the categorical features is low, is it better to still pass the categorical columns as categorical_feature= or use one-hot encoding (since in this case there won't be a giant sparse matrix)?

Best Answer

  1. The XGBoost implementation of GBM does not handle categorical features natively because it did not have to. The methodological breakthrough of XGBoost was the use of Hessian information. When other implementations (e.g. sklearn in Python, gbm in R) used just gradients, XGBoost used Hessian information when boosting. Simply put, it obliterated them in terms of speed. Handling categorical variables was an after-thought. LightGBM and CatBoost build on the work of XGBoost and primarily focus on the handling of categorical features and growing "smarter" trees. Especially for CatBoost, that is developed mainly by Yandex, an Internet search provider, the ability to work efficiently with very high cardinality features (e.g. query types) is crucial functionality.

  2. This is completely application specific. Anecdotally, I have seen Kaggle-threads where users complained about experiencing performance degradation when using categorical features and I have seen Kaggle-threads where users raved about experiencing performance boost when using categorical features. In terms of performance, other aspects of the model-fitting procedure (e.g. how to objectively measure a model's performance and/or how to avoid over-fitting) have far greater influence. The general rule is that numerical encoding (and the subsequent binning) of categorical-turned-numerical features leads to speed ups. I have come across an investigation of behaviour of decisions trees when using different encoding schema here.

    For low cardinality features, numerical encoding should make no real difference; binary features being an extreme case where there is no difference at all. The main thing gained by avoiding one-hot encoding (OHE) is the case of having very deep and unbalanced trees. When working with a low-cardinality feature this is mostly irrelevant, so the choice between OHE or numerical is mostly a matter of convenience. Obviously, one-hot encoding (minus a reference level) should be used if we want to use a factorial design and test a particular hypothesis.