Solved – Is is possible for a gradient boosting regression to predict values outside of the range seen in its training data

boostingloss-functionspredictionrandom forest

I am using http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html to fit a gradient boosting model (GBM) built on regression trees. I am using quantile loss with alpha=0.5, i.e. my loss function is mean absolute error (MAE). The optimal model with this loss function is the conditional median, $\text{median}[Y \;|\;X]$, where $Y$ denotes the predicted variable, and $X$ is a vector of covariates.

Very rarely, I have seen predictions that are outside of the range seen in the model's training data. For example, the $Y$ in my training data might lie in $[500, 20000]$, and (very rarely) I see predictions with $\hat{Y} < 500$. Is this theoretically possible with GBMs, or should I suspect that there is a bug in my code and/or in sklearn?

Assuming I understand random forests (RFs) correctly, it should be impossible for this to happen with a RF because the predicted values are all means / medians (depending on whether one uses absolute error or squared error loss) of subsets of the training data. But GBMs are different from RFs, and this argument does not carry over. Are predictions outside the range of the training data theoretically possible with GBMs?

Best Answer

In the comment you ask for an example. You can find it here (links to most informative comment, but please read entire thread for clarity).

In the above example, the most intriguing part for me is the value of -666. It is the score on the 2nd tree (the one with variable V2). Note that score falls outside of assumed distribution of $Y$, i.e. $[2000 - 20000]$.

I understand this could be because -666 from example above does not come from averaging as in simple regression tree / random forest, but from the fact that entire prediction comes from aggregation (chain-like summation) of results from different sub-trees. The summation involves weights $w$ that are assigned to each tree and the weights themselves come from:

$w_j^\ast = -\frac{G_j}{H_j+\lambda}$

where $G_j$ and $H_j$ are within-leaf calculations of first and second order derivatives of loss function, therefore they do not depend on the lower or upper $Y$ boundaries.

Please note that the linked example does not prove this is mathematically or empirically possible, because values in the example are arbitrarily selected and do not come from an actual model.

Formulas above come from xgboost website / paper

Related Solutions

Solved – Generating predictions on training data in GBM regression

The issue is the number of trees you are fitting in relation to your learning rate.

You do not provide a learning rate to your booster (called shrinkage in the R library), so the model assumes the default of $0.001$. This means you are fitting 50 trees, and the contribution of each is shrunk by $0.001$, so you're really only getting about $0.05$ of a tree.

You need to fit many, many more trees when using boosting.

library(gbm)
data(mtcars)

M <- gbm(mpg~cyl+disp+hp+wt+qsec,
         data=mtcars,
         distribution = "gaussian",
         interaction.depth=3,
         bag.fraction=0.7,
         n.trees = 10000)

p <- predict(M, n.trees = 10000)
summary(p)

Results in

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.24   15.19   18.97   20.09   25.93   26.86

To tune the appropriate number of trees, you should fit many more than you think are neccessary, then use cross validation to assess the optimal number to use for predictions. Here's an example you can running using your data.

M <- gbm(mpg~cyl+disp+hp+wt+qsec,
         data=mtcars,
         distribution = "gaussian",
         interaction.depth=2,
         n.minobsinnode = 2,
         bag.fraction=1.0,
         n.trees = 50000,
         cv.folds=3)

gbm.perf(M)

Given the very small size of your data set, it would be a good idea to bootstrap this entire cross-validation process many, many times to assess the stability of your decisions.

Solved – Training a neural network for regression always predicts the mean

I am going to contradict @Pieter's answer and say that your problem is that you have too much bias and too little variance. In other words, your network is not complex enough for this task.

To see this, let $Y$ be the true and correct output that your network should return (the target), and let $\hat{Y}$ be the output that your network actually returns. Your loss-function is the mean squared-error, averaged over all examples in your training dataset $\mathcal{D}$ : $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] $$ In this loss-function, using your network, we are trying to adjust the probability distribution of $\hat{Y}$ such that it matches the probability distribution of $Y$. In other words, we are trying to make $Y=\hat{Y}$, such that the mean squared-error is $0$. This is the lowest possible value of the mean squared-error: $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq 0 $$ However, from the question How can I prove mathematically that the mean of a distribution is the measure that minimizes the variance?, we know that the mean squared-error actually has a tighter lower-bound, which is when $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$, such that the mean squared-error loss-function becomes $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \mathbb{E}_{\mathcal{D}}[Y])^2\right] = \text{Var}(Y) $$ Since we know that the variance of $Y$ is non-negative, then the mean squared-error loss-function has the following lower-bounds $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq \text{Var}(Y) \geq 0 $$ In your case, you have reached the lower-bound $\text{Var}(Y)$, since you observe that $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$. This means that the bias (strictly speaking, this is not the correct definition of bias, but it gets the point across.) of $\hat{Y}$ is $$ (Y - \hat{Y})^2 = (Y - \mathbb{E}_{\mathcal{D}}[Y])^2 $$ The variance of $\hat{Y}$ is $$ \mathbb{E}_{\mathcal{D}}\left[\left(\hat{Y} - \mathbb{E}_{\mathcal{D}}\left[\hat{Y}\right]\right)^2\right] = \mathbb{E}_{\mathcal{D}}\left[\left(\mathbb{E}_{\mathcal{D}}[Y] - \mathbb{E}_{\mathcal{D}}[\mathbb{E}_{\mathcal{D}}[Y]]\right)^2\right] = 0 $$ Clearly, you have too much bias and too little variance.

So, how do we reach the lower-lower-bound of $0$? We need to increase the variance of $\hat{Y}$ by either adding more parameters to the network or adjusting the network architecture. As discussed in What should I do when my neural network doesn't learn? (highly recommended read), consider over-fitting and then testing your network on a single example by adding many more parameters or by adjusting the network architecture.

If the network no longer predicts the mean on a single example, then you can scale up slowly and start over-fitting and testing the network on two examples, then three examples, and so on. Otherwise, you need to keep adding more parameters/adjusting the network architecture until your network no longer predicts the mean.

Eventually, once you reach a dataset size of around 100 examples, you can start to split your data into training and testing to evaluate the generalization performance of your network. At this point, if it starts to predict the mean again, then make sure that the examples that you are adding to the dataset are similar to the examples that you already worked through in the smaller datasets. In other words, they are normalized and "look" similar. Also, keep in mind that as you add more data to the dataset, you will more likely need to add more parameters for better generalization performance.

Another helpful modification, but not as helpful as what I stated above, that helps in practice, is to slightly adjust the mean squared-error loss function itself. If your mean squared-error loss function is $$ \mathcal{L}(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2 $$ where $N$ is the dataset size, then consider using the following loss function instead: $$ \mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i)-\log(\hat{y}_i))^2\right] $$ Where $\alpha$ is a hyperparameter that can be tuned via trial and error. A starting value for $\alpha$ could be $\alpha=5$. The advantage of this loss function over the plain mean squared-error loss function is that the $\log(\cdot)$ function stretches small values in the interval $[0,1]$ away from each other, which means that small differences between $y$ and $\hat{y}$ are amplified, leading to larger gradients. I have personally found this modified loss function to be very helpful in practice.

For this to work well, it is recommended (but not necessary) that $y$ and $\hat{y}$ are each scaled to have values in the interval $[0,1]$. Also, since $\log(0)=-\infty$, and since it is likely that $y$ and $\hat{y}$ will have values very close to $0$, then it is recommended to add a small value $\epsilon$, such as $\epsilon=10^{-9}$, to $y$ and $\hat{y}$ in the loss function as follows: $$ \mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i + \epsilon)-\log(\hat{y}_i + \epsilon))^2\right] $$ This loss function may be thought of as the Mean Squared Log-scaled Error Loss.

Best Answer

Related Solutions

Solved – Generating predictions on training data in GBM regression

Solved – Training a neural network for regression always predicts the mean

Related Question