Solved – why is XGBoost giving me seriously biased predictions with small nrounds

I can not put my company data online, but I can provide a reproducible example here.

We're modelling Insurance's frequency using Poisson distribution with exposure as offset.

Here in this example, we want to model the number of claim Claims ($y_i$) with exposure Holders ($e_i$)

In the traditional GLM model, we can dirrectly model $y_i$ and put $e_i$ in the offset term. This option is not available in xgboost. So the alternative is to model the rate $\frac{y_i}{e_i}$, and put $e_i$ as a the weight term (reference)

When I do that with a lot of iteractions, the results are coherent ($\sum y_i = \sum \hat{y_i}$). But it is not the case when nrounds = 5. I think that the equation $\sum y_i = \sum \hat{y_i}$ must be satisfied after the very first iteration.

The following code is an extreme example for the sake of reproducibility. In my real case, I performed a CV on the training set (optimizing MAE), I obtained nrounds = 1200, training MAE = testing MAE. Then I re-run a xgboost on the whole data set with 1200 iteration, I see that $\sum y_i \ne \sum \hat{y_i}$ by a large distance, this doesn't make sense, or am I missing something?

So my questions are:

Am I correctly specify parameters for Poisson regression with offset in xgboost?
Why such a high bias at the first iterations?
Why after tuning nrounds using xgb.cv, I still have high bias?

Here is the graphics plotting the ratio $\frac{\sum \hat{y_i}}{\sum y_i}$ by nrounds

Code edited after the comment of @JonnyLomond

library(MASS)
library(caret)
library(xgboost)
library(dplyr)

#-------- load data --------#
data(Insurance)

#-------- data preparation --------#

#small adjustments
Insurance$rate = with(Insurance, Claims/Holders)
temp<-dplyr::select(Insurance,District, Group, Age, rate)
temp2= dummyVars(rate ~ ., data = temp, fullRank = TRUE) %>% predict(temp)

#create xgb matrix
xgbMatrix <- xgb.DMatrix(as.matrix(temp2), 
                         label = Insurance$Claims)
setinfo(xgbMatrix, "base_margin",log(Insurance$Holders))


#-------------------------------------------#
#      First model with small nround
#-------------------------------------------#
bst.1 = xgboost(data    = xgbMatrix, 
                objective ='count:poisson', 
                nrounds   = 5)


pred.1 = predict(bst.1, xgbMatrix)
sum(Insurance$Claims) #3151
sum(pred.1) #12650.8 fails



#-------------------------------------------#
#      Second model with more iteractions
#-------------------------------------------#

bst.2 = xgboost(data    = xgbMatrix, 
                objective = 'count:poisson', 
                nrounds   = 100)


pred.2 = predict(bst.2, xgbMatrix)
sum(Insurance$Claims) #3151
sum(pred.2) #same

Best Answer

First a few technical things:

You can use an offset in xgboost for Poisson regression, by setting the base_margin value in the xgb.DMatrix object.
You will not get the same results with your above code as if you use the base_margin term. (You get the same results in a GLM, but this is not a GLM. I think the weight term in xgboost means something different.)

For your question:

Sum of predictions will not equal sum of observations after a small number of rounds for several reasons:

Xgboost is regularizing predictions in the nodes (shrinking them toward 0). This will happen by default
Xgboost is scaling predictions from each tree (by a positive number less than 1). This will happen by default
Xgboost is randomly sampling rows every round, not fitting on the whole data set. I think this will not happen by default.
Some other behaviors.

Basically: It is not true that fitting one round in xgboost is the same as fitting a basic decision tree in the usual way.

Best Answer

Related Solutions

Solved – Poisson xgboost with exposure

Solved – Why is logistic regression giving a better prediction than linear reg in XGBoost

Related Question