Solved – GBM: Predict the response variable measured in {0,20}

boostingfeature selectionr

I need to predict the response that has values in {0,20}. Should it be used as a factor or as a numeric value? How does it influence on the prediction error?

I am using GBM with the Gaussian distribution to predict this variable, and the accuracy is very low.

gbm_model <- gbm(target~., data=traindf, distribution = "gaussian", n.trees = 500, 
                 bag.fraction = 0.75, cv.folds = 5, interaction.depth = 3)

For predicting I am using this code:

response_column <- which(colnames(testdf) == "target")
predictions_gbm <- predict(gbm_model, newdata = testdf[, -response_colum], 
                           n.trees = 500, type = "response")

Best Answer

Classifying cases into 21 different classes is hard, so when your response variable is an integer in 0 … 20, you probably don't want to just convert it to a factor. It's hard to give more concrete advice than that without knowing much about the distribution of the data or where the data comes from or what the data is about. You can try transformations of the response variable (such as adding 1 and taking the logarithm) or cutting it into a few discrete categories (like 0 through 10 and 11 through 20). Decisions about how to code the response variable should be made with reference to its meaning (is 11 a sensible threshold?), its distribution (try not to create categories with only a few training samples), and your model.

In any case, remember that you can clip your predictions to [0, 20]; there's no sense in predicting something outside the range of the response variable.

Related Solutions

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Solved – Predict with type=’response’ for GLM with errorest() function in package ipred

You never specified that the GLM that you want to run should be the binomial logit. You will need to wrap that logic in your own function as well, as shown below:

data(mtcars)
require(ipred)

# user defined model function
myGLM = function(formula, data) {
  glm(formula, data, family = binomial(link = logit))
}

# user-defined prediction function
myPredictGLM = function(object, newdata){
  predict(object, newdata , type="response")
}

# run errorest with user-defined modeling and prediction functions
logitErrEst = errorest(am ~ mpg + hp, data=mtcars, model=myGLM, 
                    predict=myPredictGLM, estimator="cv", 
                    est.para=control.errorest(k=5, predictions = TRUE))

# check
summary(logitErrEst$predictions)

Best Answer

Related Solutions

Solved – Why does GBM predict different values for the same data

Solved – Predict with type=’response’ for GLM with errorest() function in package ipred

Related Question