The easiest way to handle 'tuning' of the num_rounds
parameter is to let XGBoost do it for you. You can set theearly_stopping_rounds
parameter to n
in the train
method and the model will stop training once error hasn't decreased for n
rounds.
See this example from the Liberty Mutual Kaggle Competition:
As noted in the code below, you'll need to also use the watchlist
parameter to enable early stopping.
# You can write R code here and then click "Run" to run it on our platform
# The readr library is the best way to read and write CSV files in R
library(readr)
library(xgboost)
library(data.table)
library(Matrix)
library(caret)
# The competition datafiles are in the directory ../input
# Read competition data files:
train <- read_csv("../input/train.csv")
test <- read_csv("../input/test.csv")
# Generate output files with write_csv(), plot() or ggplot()
# Any files you write to the current directory get shown as outputs
# keep copy of ID variables for test and train data
train_Id <- train$Id
test_Id <- test$Id
# response variable from training data
train_y <- train$Hazard
# predictor variables from training
train_x <- subset(train, select = -c(Id, Hazard))
train_x <- sparse.model.matrix(~., data = train_x)
# predictor variables from test
test_x <- subset(test, select = -c(Id))
test_x <- sparse.model.matrix(~., data = test_x)
# Set xgboost parameters
param <- list("objective" = "reg:linear",
"eta" = 0.05,
"min_child_weight" = 10,
"subsample" = .8,
"colsample_bytree" = .8,
"scale_pos_weight" = 1.0,
"max_depth" = 5)
# Using 5000 rows for early stopping.
offset <- 5000
num_rounds <- 1000
# Set xgboost test and training and validation datasets
xgtest <- xgb.DMatrix(data = test_x)
xgtrain <- xgb.DMatrix(data = train_x[offset:nrow(train_x),], label= train_y[offset:nrow(train_x)])
xgval <- xgb.DMatrix(data = train_x[1:offset,], label= train_y[1:offset])
# setup watchlist to enable train and validation, validation must be first for early stopping
watchlist <- list(val=xgval, train=xgtrain)
# to train with watchlist, use xgb.train, which contains more advanced features
# this will use default evaluation metric = rmse which we want to minimise
bst1 <- xgb.train(params = param, data = xgtrain, nround=num_rounds, print.every.n = 20, watchlist=watchlist, early.stop.round = 50, maximize = FALSE)
Is overfitting so bad that you should not pick a model that does overfit, even though its test error is smaller? No. But you should have a justification for choosing it.
This behavior is not restricted to XGBoost. It is a common thread among all machine learning techniques; finding the right tradeoff between underfitting and overfitting. The formal definition is the Bias-variance tradeoff (Wikipedia).
The bias-variance tradeoff
The following is a simplification of the Bias-variance tradeoff, to help justify the choice of your model.
We say that a model has a high bias if it is not able to fully use the information in the data. It is too reliant on general information, such as the most frequent case, the mean of the response, or few powerful features. Bias can come from wrong assumptions, for exemple assuming that the variables are Normally distributed or that the model is linear.
We say that a model has high variance if it is using too much information from the the data. It relies on information that is revelant only in the training set that has been presented to it, which does not generalize well enough. Typically, the model will change a lot if you change the training set, hence the "high variance" name.
Those definition are very similar to the definitions of underfitting and overfitting. However, those definition are often too simplified to be opposites, as in
- The model is underfitting if both the training and test error are high. This means that the model is too simple.
- The model is overfitting if the test error is higher than the training error. This means that the model is too complex.
Those simplifications are of course helpful, as they help choosing the right complexity of the model. But they overlook an important point, the fact that (almost) every model has both a bias and a variance component. The underfitting/overfitting description tell you that you have too much bias/too much variance, but you (almost) always have both.
If you want more information about the bias-variance tradeoff, they are a lot of helpful visualisation and good ressource available through google. Every machine learning textbook will have a section on the bias-variance tradeoff, here are a few
- An introduction to statistical learning and Elements of statistical learning (available here).
- Pattern Recognition and Machine Learning, by Christopher Bishop.
- Machine Learning: A Probabilistic Perspective, by Kevin Murphy.
Also, a nice blog post that helped me grasp is Scott Fortmann-Roe's Understanding the Bias-Variance Tradeoff.
Application to your problem
So you have two models,
$$
\begin{array}{lrrl}
& \text{Train MAE} & \text{Test MAE} &\\
\text{MARS} & \sim4.0 & \sim4.0 & \text{Low variance, higher bias},\\
\text{XGBoost} & \sim0.3 & \sim2.4 & \text{Higher variance, lower bias},\\
\end{array}
$$
and you need to pick one. To do so, you need to define what is a better model. The parameters that should be included in your decisions are the complexity and the performance of the model.
- How many "units" of complexity are you willing to exchange for one "unit" of performance?
- More complexity is associated with higher variance. If you want your model to generalize well on a dataset that is a little bit different than the one you have trained on, you should aim for less complexity.
- If you want a model that you can understand easily, you can do so at the cost of performance by reducing the complexity of the model.
- If you are aiming for the best performance on a dataset that you know comes from the same generative process than your training set, you can manipulate complexity in order to optimize your test error and use this as a metric. This happens when your training set is randomly sampled from a larger set, and your model will be applied on this set. This is the case in most Kaggle competitions, for exemple.
The goal here is not to find a model that "does not overfit". It is to find the model that has the best bias-variance tradeoff. In this case, I would argue that the reduction in bias accomplished by the XGBoost model is good enough to justify the increase in variance.
What can you do
However, you can probably do better by tuning the hyperparameters.
Increasing the number of rounds and reducing the learning rate is a possibility. Something that is "weird" about gradient boosting is that running it well past the point where the training error has hit zero seems to still improve the test error (as discussed here: Is Deeper Better Only When Shallow Is Good?). You can try to train your model a little bit longer on your dataset once you have set the other parameters,
The depth of the trees you grow is a very good place to start. You have to note that for every one unit of depth, you double the number of leafs to be constructed. If you were to grow trees of size two instead of size 16, it would take $1/2^{14}$ of the time! You should try growing more smaller trees. The reason why is that the depth of the tree should represent the degree of feature interaction. This may be jargon, but if your features have a degree of interaction of 3 (Roughly: A combination of 4 features is not more powerful than a combination of 3 of those feature + the fourth), then growing trees of size larger than 3 is detrimental. Two trees of depth three will have more generalization power than one tree of depth four. This is a rather complicated concept and I will not go into it right now, but you can check this collection of papers for a start. Also, note that deep trees lead to high variance!
Using subsampling, known as bagging, is great to reduce variance. If your individual trees have a high variance, bagging will average the trees and the average has less variance than individual trees. If, after tuning the depth of your trees, you still encounter high variance, try to increase subsampling (that is, reduce the fraction of data used). Subsampling of the feature space also achieves this goal.
Best Answer
I will answer myself and let you know my findings in case anybody is interested.
First the bias: I took the time to collect all the recent data and format it correclty and so on. I should have done this long before. The picture is the following:
You see the data from the end of 2015 and then April 16. The price level is totally different. A model trained on 2015 data can in no way get this change.
Second: The fit of xgboost. I really liked the following set-up. train and test error are much close now and still good:
Thus I use a lot of trees and all of them are at most 3 splits deep (as recommended here). Doing this the calculation is quick (the tree size grows by a factor of 2 with each split) and the overfit seems to be reduced.
My summary: use trees with a small number of leaves but a lot of them and look for recent data. For the competition this was bad luck for me...