The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.
I was able to reproduce your error with the data OrchardSprays
data(OrchardSprays)
model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
firstrow <- OrchardSprays[1,]
str(firstrow)
manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)
predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
output:
> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818
since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick
manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)
predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
output:
> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276
You never specified that the GLM that you want to run should be the binomial logit. You will need to wrap that logic in your own function as well, as shown below:
data(mtcars)
require(ipred)
# user defined model function
myGLM = function(formula, data) {
glm(formula, data, family = binomial(link = logit))
}
# user-defined prediction function
myPredictGLM = function(object, newdata){
predict(object, newdata , type="response")
}
# run errorest with user-defined modeling and prediction functions
logitErrEst = errorest(am ~ mpg + hp, data=mtcars, model=myGLM,
predict=myPredictGLM, estimator="cv",
est.para=control.errorest(k=5, predictions = TRUE))
# check
summary(logitErrEst$predictions)
Best Answer
Classifying cases into 21 different classes is hard, so when your response variable is an integer in 0 … 20, you probably don't want to just convert it to a factor. It's hard to give more concrete advice than that without knowing much about the distribution of the data or where the data comes from or what the data is about. You can try transformations of the response variable (such as adding 1 and taking the logarithm) or cutting it into a few discrete categories (like 0 through 10 and 11 through 20). Decisions about how to code the response variable should be made with reference to its meaning (is 11 a sensible threshold?), its distribution (try not to create categories with only a few training samples), and your model.
In any case, remember that you can clip your predictions to [0, 20]; there's no sense in predicting something outside the range of the response variable.