So, boosting is a learning algorithm, which can generate high-accuracy predictions using as a subroutine another algorithm, which in turn can efficiently generate hypotheses just slightly better (by an inverse polynomial) than random guessing.
It's main advantage is speed.
When Schapire presented it in 1990 it was a breakthrough in that it showed that a polynomial time learner generating hypotheses with errors just slightly smaller than 1/2 can be transformed into a polynomial time learner generating hypotheses with an arbitrarily small error.
So, the theory to back up your question is in "The strength of weak learnability" (pdf) where he basically showed that the "strong" and "weak" learning are equivalent.
And perhaps the answer the the original question is, "there's no point constructing strong learners when you can construct weak ones more cheaply".
From the relatively recent papers, there's "On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms" (pdf) which I don't understand but which seems related and may be of interest to more educated people :)
Suppose you are trying to minimize the objective function via number of iterations. And current value is $100.0$. In given data set, there are no "irreducible errors" and you can minimize the loss to $0.0$ for your training data. Now you have two ways to do it.
The first way is "large learning rate" and few iterations. Suppose you can reduce loss by $10.0$ in each iteration, then, in $10$ iterations, you can reduce the loss to $0.0$.
The second way would be "slow learning rate" but more iterations. Suppose you can reduce loss by $1.0$ in each iteration and you need $100$ iteration to have 0.0 loss on your training data.
Now think about this: are the two approaches equal? and if not which is better in optimization context and machine learning context?
In optimization literature, the two approaches are the same. As they both converge to optimal solution. On the other hand, in machine learning, they are not equal. Because in most cases we do not make the loss in training set to $0$ which will cause over-fitting.
We can think about the first approach as a "coarse level grid search", and second approach as a "fine level grid search". Second approach usually works better, but needs more computational power for more iterations.
To prevent over-fitting, we can do different things, the first way would be restrict number of iterations, suppose we are using the first approach, we limit number of iterations to be 5. At the end, the loss for training data is $50$. (BTW, this would be very strange from the optimization point of view, which means we can future improve our solution / it is not converged, but we chose not to. In optimization, usually we explicitly add constraints or penalization terms to objective function, but usually not limit number of iterations.)
On the other hand, we can also use second approach: if we set learning rate to be small say reduce $0.1$ loss for each iteration, although we have large number of iterations say $500$ iterations, we still have not minimized the loss to $0.0$.
This is why small learning rate is sort of equal to "more regularizations".
Here is an example of using different learning rate on an experimental data using xgboost
. Please check follwoing two links to see what does eta
or n_iterations
mean.
Parameters for Tree Booster
XGBoost Control overfitting
For the same number of iterations, say $50$. A small learning rate is "under-fitting" (or the model has "high bias"), and a large learning rate is "over-fitting" (or the model has "high variance").
PS. the evidence of under-fitting is both training and testing set have large error, and the error curve for training and testing are close to each other. The sign of over-fitting is training set's error is very low and testing set is very high, two curves are far away from each other.
Best Answer
Trees are often a good choice for new data mining problems with unknown structure. From The Elements of Statistical Learning section 10.7:
Sorry, long quote, but I think it explains their rationale/excuse for only addressing trees. They go on to derive a Boosting algorithm that uses trees, explain what parameters they have to find to make it work, define steepest descent in the context of what I have seen elsewhere called the "function space" (f = {f(x1), f(x2), ... f(xN)}), explain how constraining their next classifier in the ensemble to a tree that has similar outputs to the steepest descent "direction" generalizes better than just adding the steepest descent function directly (Some kind of regularization going on?), and then define an algorithm.
But they neglect to address the fact that we often do want to use Boosting in the context of pure prediction. And it's great we see such an amazing performance improvement when GradientBoosting with trees, but what about the other weak learners we are used to Boosting being able to handle?
There is no principled reason decision trees could not be replaced with some other low-bias, high-variance learner like small neural nets. It would definitely change steps 2.b. and 2.c. of their algorithm 10.3. It might also change the initialization at step 1. But it could be done! I am not sure there are any common or popular algorithms in the literature, though. When I looked for them I came across your question instead. It may be that trees are easier to train and have a provably low risk as an ensemble such that no one expects anything else to be able to do better?