Solved – Why use dumthe variables in GBM using CARET library in R

boostingcaretcartr

I have seen a few examples implemting the gbm algorithm on youtube using the titanic dataset. These examples have turned some factor variables into dummy/indicator variables when GBM can handle factor variables by internally creating dummies. I am working on an example with healthcare data and I have ended up transforming some factor variables with less than 10 levels into dummy variables. I want to ask if such a transformation can create a problem when it comes to classification accuracy?

My other questions are:

What is the benefit of using a dummy variable with gbm compared to using a factor variable with less than 10 levels?
Does anyone has any literature recommending or presenting a contrast to the use of dummy variables with GBM?

I will appreciate help in this regard.

Thanks.

Best Answer

My experience across a bunch of data sets (some of which are documented in section 14.7 of APM) is that it doesn't change performance in any one direction (i.e. in some changes it is betters, worse in others). I have yet to see a huge difference.

However, most tree based models have an algorithm that, when given a categorical predictor, find the optimal binary split. A lot of these look at different configurations of how to split the category (e.g. 2 values on one side, 3 on the other). If you have dummy variables, it only considers one value of that predictor at a time. Even though it has more predictors to sift through, I find that using dummy variables makes the training time shorter and the trees slightly deeper.

Max

Related Solutions

Solved – How to encode factors as dumthe variables when using stepPlr

See the first example given in help for step.plr

n <- 100
p <- 3
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- data.frame(x1=factor(z[ ,1]),x2=factor(z[ ,2]),x3=factor(z[ ,3]))
y <- sample(c(0,1),n,replace=TRUE)
fit <- step.plr(x,y)
# 'level' is automatically generated. Check 'fit$level'.

Does that answer your question?

Solved – GBM package vs. Caret using GBM

Use with the default grid to optimize parameters and use predict to have the same results:

R2.caret-R2.gbm=0.0009125435

rmse.caret-rmse.gbm=-0.001680319

library(caret)
library(gbm)
library(hydroGOF)
library(Metrics)
data(iris)

# Using caret with the default grid to optimize tune parameters automatically
# GBM Tuning parameters:
# n.trees (# Boosting Iterations)
# interaction.depth (Max Tree Depth)
# shrinkage (Shrinkage)
# n.minobsinnode (Min. Terminal Node Size)

metric <- "RMSE"
trainControl <- trainControl(method="cv", number=10)

set.seed(99)
gbm.caret <- train(Sepal.Length ~ .
                   , data=iris
                   , distribution="gaussian"
                   , method="gbm"
                   , trControl=trainControl
                   , verbose=FALSE
                   #, tuneGrid=caretGrid
                   , metric=metric
                   , bag.fraction=0.75
                   )                  

print(gbm.caret)

caret.predict <- predict(gbm.caret, newdata=iris, type="raw")

rmse.caret<-rmse(iris$Sepal.Length, caret.predict)
print(rmse.caret)

R2.caret <- cor(gbm.caret$finalModel$fit, iris$Sepal.Length)^2
print(R2.caret)

#using gbm without caret with the same parameters
set.seed(99)
gbm.gbm <- gbm(Sepal.Length ~ .
               , data=iris
               , distribution="gaussian"
               , n.trees=150
               , interaction.depth=3
               , n.minobsinnode=10
               , shrinkage=0.1
               , bag.fraction=0.75
               , cv.folds=10
               , verbose=FALSE
               )
best.iter <- gbm.perf(gbm.gbm, method="cv")
print(best.iter)

train.predict <- predict.gbm(object=gbm.gbm, newdata=iris, 150)

rmse.gbm<-rmse(iris$Sepal.Length, train.predict)
print(rmse.gbm)

R2.gbm <- cor(gbm.gbm$fit, iris$Sepal.Length)^2
print(R2.gbm)

print(R2.caret-R2.gbm)
print(rmse.caret-rmse.gbm)

Best Answer

Related Solutions

Solved – How to encode factors as dumthe variables when using stepPlr

Solved – GBM package vs. Caret using GBM

Related Question