Boosting – XGBoost Obtain Optimal n_estimators Parameter with Early Stopping

boostingmodel-evaluation

If i use early stopping with an evaluation set for training, when i have to train the model for the final evaluation what is the best approach? Generally I'd train the model with the full dataset but in this case I can't use the early stopping feature since i have no validation. Is there a way to obtain the proper n_estimators value from the training with the evaluation set and then use it as a parameter? Or it's better to use even for the final result only the partial trained model with early stopping and evaluation set?

Thank you

Best Answer

You are correct to assume that when using early stopping, following a train-validation split of our data, we will potentially estimate the optimal number of estimators $M$ as being lower than the one that would be optimal when training on the full dataset, $M_{full}$. In a sense, that is only natural as when utilise a larger dataset to train our algorithm we should be are able to learn a richer set of rules without necessarily over-fitting. To be clear: The number of iterations $M$ is the one we computed when using early stopping. For XGBoost, assuming we use train with early stopping, that can be found under the argument best_iteration.

I have not come across a general rule or a research paper on how to accurately estimate the final number of iterations when training on the full training set following a CV procedure. I have come across a rough approximation where if we use $P\%$ of our data in our validation set and we get $M$ iterations as the optimal number, we can approximate the number of iterations when training with the full dataset as $M_{full} = \frac{M}{1-0.01P}$. This is for example put forward by some experienced Kaggle competitors (competition masters or grand masters) here and here. Similarly, another experienced Kaggle competitor also suggests here multiplying $M$ by a fixed factor close to $1.1$ to get the number $M_{full}$ to be used when training the final model. From personal experience, (I am not an experienced Kaggle competitor) I have found that using a slightly increased number of iteration $M$ (about 3-10% more than the one suggested by early-stopping) indeed increases my leader-board position; i.e. it helps the model trained on the full dataset to have better generalisation performance.

Note that if we use a cross-validation schema instead of a fixed validation set, each fold might have a different number of optimal iterations $M$. In that case, we need to be careful not to over-simplify things. It would be relevant to go ahead and look into ensuring that the number of optimal iterations per fold is "ball-park the same" e.g. within 10% of the mean of $M$ across all folds. Otherwise we probably have too variable estimates to reasonably average them. In that case it would be prudent to look into making the per-fold performance more stable before continuing (e.g. by stratifying our response variables and/or by increasing the regularisation parameters used).

The above being said, even if we decide not to scale the "early-stopping" optimal number of iterations, we should re-train our model using the full dataset. This has been covered multiple times in CV.SE; see for example the threads :

for more details.