Boosting – Why Test Error Doesn’t Increase with High Number of Boosting Iterations: Figure 10.13 Analysis from ‘The Elements of Statistical Learning’

boostingoverfitting

My question refers to the figure 10.13 of The Elements of Statistical Learning. Test error decreases monotonically with the increase in tree iterations. However, I don't understand why the test error does not raise for the high number of iterations. In the figure below the test error increases for the more complex model.

enter image description here

EDIT:
The boosting algorithm combines the output of weak learners (often trees) to produce strong predictions.

$$
f_{M}(x) = \sum_{m=1}^{M}f_{m}(x;\theta_{m})
$$

where M is a tunable hyperparameter that controls trade off between bias/variance.

In the section 10.14.1 of The Elements of Statistical Learning the gradient boosting model was fit to California Housing data. In the figure 10.13 the average absolute error for training and testing data is decreasing monotonically with the number of boosting iterations (M). This is what authors write:

The test error is seen to decrease monotonically with increasing M,
more rapidly during the early stages and then leveling off to being nearly constant as iterations increase. Thus, the choice of a particular value of M is not critical, as long as it is not too small

So, I think authors claim that even for a big number of iterations M the test error will level off. However, shouldn't the test error rise for the very complex model (indicated by a high number of M)?

Best Answer

Note the rest of the paragraph you quoted:

This tends to be the case in many applications. The shrinkage strategy (10.41) tends to eliminate the problem of overfitting, especially for larger data sets.

Shrinkage slows down overfitting, but it does happen. Using basically the same setup as in the example, here's what I get when using sklearn set to many more trees (for whatever reason, the leveling off doesn't happen for me until already a bit further than in ESL):
learning curve for 10000 trees

I thought perhaps a big part of this was also the weak learners themselves: eventually they don't have enough capacity to fit to the gradient very well, and so reaching the actual training minimum isn't easy. I tried it with depth-1 trees, and built a classification problem that incorporates an xor-style data generating process (which depth-1 trees can't see), and the result suggests I was wrong about that:
test loss increases substantially
I suppose actually the xor part makes it very hard to find the real process, and instead the trees eventually get built just on the random patterns?

Colab notebook

Related Question