Solved – Poisson xgboost with exposure

boostingcaretoffsetpoisson-regressionr

I was trying to model a count dependent variable with uneven exposure. Classical glms would use log(exposure) as offset, also gbm does, but xgboost does not allow for offset until now…

Trying to find a drawback this example in crossvalidated (Where does the offset go in Poisson/negative binomial regression?) suggested me to model frequency (real number) instead of counts weighting by Exposure.

I tried to work aroung some xgboost code to apply the same method on my data but I failed…. Below the code I set out:

library(MASS)
data(Insurance)
library(xgboost)
options(contrasts=c("contr.treatment","contr.treatment")) #fissa i 

Insurance$freq<-with(Insurance, Claims/Holders )
library(caret)

temp<-dplyr::select(Insurance,District, Group, Age,freq)
temp2= dummyVars(freq ~ ., data = temp, fullRank = TRUE) %>% predict(temp)

xgbMatrix <- xgb.DMatrix(as.matrix(temp2), 
                     label = Insurance$freq, 
                     weight = Insurance$Holders)

bst = xgboost(data=xgbMatrix, label = Insurance$freq,    objective='count:poisson',nrounds=5)
#In xgb.get.DMatrix(data, label) : xgboost: label will be ignored. 
#strange warning

Insurance$predFreq<-predict(bst, xgbMatrix)

with(Insurance, sum(Claims)) #3151
with(Insurance, sum(predFreq*Holders)) #7127 fails

Can anybody help? Also, I was wondering if It were possible to run all using caret's train…

Best Answer

According to the answer in: https://stackoverflow.com/questions/34896004/xgboost-offset-exposure

xgboost can handle offset term as in glm or gbm using setinfo, but this method is not documented very well.

In your example, the code would be: setinfo(xgbMatrix,"base_margin",log(Insurance$Holders))

Related Question