Solved – Standardizing numerical and encoding of categorical data for training boosted decision tree

boostinggradient

Is there a "best practice" way of standardizing numerical and encoding of categorical data for training boosted decision tree? Both for classification and regression problems

Best Answer

First, you could pick a learner, that does support categorical splits such as the R gbm package (in contrary to xgboost).

You could simply randomly enumerate categories and treat as numerical. This procedure works surprisingly well. So if you prefer xgboost, you may just be lazy and simply convert/coerce your data.frame of mixed factors(categoricals) and numeric features into a numeric matrix and pass to xgboost.

One hot encoding means each category gets a dummy variable and is either zero or one. This method only allow one-vs-all splits. I would try first two options first.

Sometimes your feature have numerous number of categories. It is often not as useful to simply plug-in such a feature into the model by any method. It may be worth to cluster the categories with kmeans and/or cautiously bin(few bins, to avoid over-fitting) the categories by naively expected target value.