Solved – preprocessing and input format for gradient boosted trees

boostingcartcategorical datadata preprocessingspark-mllib

I'm using MlLib library de Spark and trying to perform Gradient-Boosted Trees algorithm on my data, that has mostly categorical features (and just two numerical features). in the example given in Spark documentation they are using LibSVMFile, that has only numerical values. But my data is json file, containing the categorical features, that I can easily parse into the Spark Dataframe, but no in LibSVMFile. In the description of algorithm though it's clearly said that

GBTs handle categorical features

along with

do not require feature scaling, and are able to capture non-linearities and feature interactions.

As I understand this introduction, I don't need to make preprocessing of categorical features and their conversion into dummy variables, as well as I don't need to scale my numerical features and create the additional ones to model interactions and non-linearity (PolynomialExpansion) .

But in the same time the input for algorithm is RDD[LabeledPoint], that is a Vector[Double] as far as I know. So I don't quite understand how will algorithm handle categorical data if it only accepts the Double as input?

I guess, I can misunderstand the introduction or even the mechanism of algorithm, so maybe I need perform come hashing on categorical features, but I don't find any documentation about preprocessing the data for Gradient Boosted Trees, so your help and advice will be appreciated.

Best Answer

It seems like your problem is with the programming side and not the statistics. Scaling is not needed for most tree based algorithms, but you should still pass the correct data format to the library you are using. If your tool requires doubles as input, convert your categorical variables to doubles.

For exemple, map a 3-level categorical variables A,B,C to 0,1,2 or any numerical value you want.

Related Question