Solved – GBM: Predict the response variable measured in {0,20}

boostingfeature selectionr

I need to predict the response that has values in {0,20}. Should it be used as a factor or as a numeric value? How does it influence on the prediction error?

I am using GBM with the Gaussian distribution to predict this variable, and the accuracy is very low.

gbm_model <- gbm(target~., data=traindf, distribution = "gaussian", n.trees = 500, 
                 bag.fraction = 0.75, cv.folds = 5, interaction.depth = 3)

For predicting I am using this code:

response_column <- which(colnames(testdf) == "target")
predictions_gbm <- predict(gbm_model, newdata = testdf[, -response_colum], 
                           n.trees = 500, type = "response")

Best Answer

Classifying cases into 21 different classes is hard, so when your response variable is an integer in 0 … 20, you probably don't want to just convert it to a factor. It's hard to give more concrete advice than that without knowing much about the distribution of the data or where the data comes from or what the data is about. You can try transformations of the response variable (such as adding 1 and taking the logarithm) or cutting it into a few discrete categories (like 0 through 10 and 11 through 20). Decisions about how to code the response variable should be made with reference to its meaning (is 11 a sensible threshold?), its distribution (try not to create categories with only a few training samples), and your model.

In any case, remember that you can clip your predictions to [0, 20]; there's no sense in predicting something outside the range of the response variable.