Solved – Additive bias in xgboost (and its correction?)

bias correctionboostingmachine learningoverfitting

I am taking part in a competition right now. I know it is my job to do that well, but maybe somebody wants to discuss my problem and its solution here as this could be helfull for others in their field too.

I have trained an xgboost model (a tree based model and a linear one and an ensemble of the two). As discussed already here the mean absolute error (MAE) on the training set (where I did cross-validation) was small (approx. 0.3) then on the held-out test set the error was around 2.4.
Then the competition started and the error was around 8 (!) and surprisingly the
forecast was always approx 8-9 above the true value !! See the region circled in yellow in the picture:

enter image description here

I have to say that period of the training data ended in Oct '15 and the competition started right now (April '16 with a test period of approx 2 weeks in March).

Today I just substracted the constant values of 9 from my forecast and the error went down to 2 and I got number 3 on the leadboard (for this one-day). 😉
This is the part right of the yellow line.

So what I would like to discuss:

  • How does xgboost react to adding an intercept term to the model equation? Could this lead to bias if the system changes too much (as it did in my case from Oct 15 to April 16)?
  • Could an xgboost model without intercept be more robust to parallel shifts in the target value?

I will go on subtracting my bias of 9 and if anybody is interested I could show you the result. It would just be more interesting to get more insight here.

Best Answer

I will answer myself and let you know my findings in case anybody is interested.

First the bias: I took the time to collect all the recent data and format it correclty and so on. I should have done this long before. The picture is the following:

enter image description here

You see the data from the end of 2015 and then April 16. The price level is totally different. A model trained on 2015 data can in no way get this change.

Second: The fit of xgboost. I really liked the following set-up. train and test error are much close now and still good:

xgb_grid_1 <- expand.grid(
    nrounds = c(12000),
    eta = c(0.01),
    max_depth = c(3),
    gamma = 1,
    colsample_bytree = c(0.7),
    min_child_weight = c(5) 
  )

  xgb_train_1 <- train(
    x = training,y = model.data$Price[inTrain],
    trControl = ctrl,
    tuneGrid = xgb_grid_1,
    method="xgbTree" 
   ,subsample = 0.8
    )

Thus I use a lot of trees and all of them are at most 3 splits deep (as recommended here). Doing this the calculation is quick (the tree size grows by a factor of 2 with each split) and the overfit seems to be reduced.

My summary: use trees with a small number of leaves but a lot of them and look for recent data. For the competition this was bad luck for me...