Solved – Why use dumthe variables in GBM using CARET library in R

boostingcaretcartr

I have seen a few examples implemting the gbm algorithm on youtube using the titanic dataset. These examples have turned some factor variables into dummy/indicator variables when GBM can handle factor variables by internally creating dummies. I am working on an example with healthcare data and I have ended up transforming some factor variables with less than 10 levels into dummy variables. I want to ask if such a transformation can create a problem when it comes to classification accuracy?

My other questions are:

  1. What is the benefit of using a dummy variable with gbm compared to using a factor variable with less than 10 levels?
  2. Does anyone has any literature recommending or presenting a contrast to the use of dummy variables with GBM?

I will appreciate help in this regard.

Thanks.

Best Answer

My experience across a bunch of data sets (some of which are documented in section 14.7 of APM) is that it doesn't change performance in any one direction (i.e. in some changes it is betters, worse in others). I have yet to see a huge difference.

However, most tree based models have an algorithm that, when given a categorical predictor, find the optimal binary split. A lot of these look at different configurations of how to split the category (e.g. 2 values on one side, 3 on the other). If you have dummy variables, it only considers one value of that predictor at a time. Even though it has more predictors to sift through, I find that using dummy variables makes the training time shorter and the trees slightly deeper.

Max

Related Question