Solved – Is is possible for a gradient boosting regression to predict values outside of the range seen in its training data

boostingloss-functionspredictionrandom forest

I am using http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html to fit a gradient boosting model (GBM) built on regression trees. I am using quantile loss with alpha=0.5, i.e. my loss function is mean absolute error (MAE). The optimal model with this loss function is the conditional median, $\text{median}[Y \;|\;X]$, where $Y$ denotes the predicted variable, and $X$ is a vector of covariates.

Very rarely, I have seen predictions that are outside of the range seen in the model's training data. For example, the $Y$ in my training data might lie in $[500, 20000]$, and (very rarely) I see predictions with $\hat{Y} < 500$. Is this theoretically possible with GBMs, or should I suspect that there is a bug in my code and/or in sklearn?

Assuming I understand random forests (RFs) correctly, it should be impossible for this to happen with a RF because the predicted values are all means / medians (depending on whether one uses absolute error or squared error loss) of subsets of the training data. But GBMs are different from RFs, and this argument does not carry over. Are predictions outside the range of the training data theoretically possible with GBMs?

Best Answer

In the comment you ask for an example. You can find it here (links to most informative comment, but please read entire thread for clarity).

In the above example, the most intriguing part for me is the value of -666. It is the score on the 2nd tree (the one with variable V2). Note that score falls outside of assumed distribution of $Y$, i.e. $[2000 - 20000]$.

I understand this could be because -666 from example above does not come from averaging as in simple regression tree / random forest, but from the fact that entire prediction comes from aggregation (chain-like summation) of results from different sub-trees. The summation involves weights $w$ that are assigned to each tree and the weights themselves come from:

$w_j^\ast = -\frac{G_j}{H_j+\lambda}$

where $G_j$ and $H_j$ are within-leaf calculations of first and second order derivatives of loss function, therefore they do not depend on the lower or upper $Y$ boundaries.

Please note that the linked example does not prove this is mathematically or empirically possible, because values in the example are arbitrarily selected and do not come from an actual model.

Formulas above come from xgboost website / paper