Boosting – Using Weak Learners Other Than Trees

boostinggaussian processneural networksrandom forest

Descriptions of boosting (and specific techniques like gradient boosting) typically talk about "weak learners" in general, and then immediately switch to discussing boosted trees specifically.

Why is this? I've seen some discussions regarding boosting SVMs, but little to no discussion about boosting neural networks or Gaussian processes, for instance. Is there any literature on the topic of why trees are more suitable for boosting than other "learners"?

Best Answer

Trees are often a good choice for new data mining problems with unknown structure. From The Elements of Statistical Learning section 10.7:

"Industrial and commercial data mining applications tend to be especially challenging in terms of the requirements placed on learning procedures. Data sets are often very large in terms of number of observations and number of variables measured on each of them. Thus, computational considerations play an important role. Also, the data are usually messy: the inputs tend to be mixtures of quantitative, binary, and categorical variables, the latter often with many levels. There are generally many missing values, complete observations being rare. Distributions of numeric predictor and response variables are often long-tailed and highly skewed...In addition they usually contain a substantial fraction of gross mis-measurements (outliers). The predictor variables are generally measured on very different scales.

In data mining applications, usually only a small fraction of the large number of predictor variables that have been included in the analysis are actually relevant to prediction. Also, unlike many applications such as pattern recognition, there is seldom reliable domain knowledge to help create especially relevant features and/or filter out the irrelevant ones, the inclusion of which dramatically degrades the performance of many methods.

In addition, data mining applications generally require interpretable models. It is not enough to simply produce predictions. It is also desirable to have information providing qualitative understanding of the relationship between joint values of the input variables and the resulting predicted response value. Thus, black box methods such as neural networks, which can be quite useful in purely predictive settings such as pattern recognition, are far less useful for data mining.

These requirements of speed, interpretability and the messy nature of the data sharply limit the usefulness of most learning procedures as off- the-shelf methods for data mining. An “off-the-shelf” method is one that can be directly applied to the data without requiring a great deal of time- consuming data preprocessing or careful tuning of the learning procedure.

Of all the well-known learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. They are relatively fast to construct and they produce interpretable models (if the trees are small). As discussed in Section 9.2, they naturally incorporate mixtures of numeric and categorical predictor variables and missing values. They are invariant under (strictly monotone) transformations of the individual predictors. As a result, scaling and/or more general transformations are not an issue, and they are immune to the effects of predictor outliers. They perform internal feature selection as an integral part of the procedure. They are thereby resistant, if not completely immune, to the inclusion of many irrelevant predictor variables. These properties of decision trees are largely the reason that they have emerged as the most popular learning method for data mining.

Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy. They seldom provide predictive accuracy comparable to the best that can be achieved with the data at hand. As seen in Section 10.1, boosting decision trees improves their accuracy, often dramatically. At the same time it maintains most of their desirable properties for data mining. Some advantages of trees that are sacrificed by boosting are speed, interpretability, and, for AdaBoost, robustness against overlapping class distributions and especially mislabeling of the training data. A gradient boosted model (GBM) is a generalization of tree boosting that attempts to mitigate these problems, so as to produce an accurate and effective off-the-shelf procedure for data mining."

Sorry, long quote, but I think it explains their rationale/excuse for only addressing trees. They go on to derive a Boosting algorithm that uses trees, explain what parameters they have to find to make it work, define steepest descent in the context of what I have seen elsewhere called the "function space" (f = {f(x₁), f(x₂), ... f(x_N)}), explain how constraining their next classifier in the ensemble to a tree that has similar outputs to the steepest descent "direction" generalizes better than just adding the steepest descent function directly (Some kind of regularization going on?), and then define an algorithm.

But they neglect to address the fact that we often do want to use Boosting in the context of pure prediction. And it's great we see such an amazing performance improvement when GradientBoosting with trees, but what about the other weak learners we are used to Boosting being able to handle?

There is no principled reason decision trees could not be replaced with some other low-bias, high-variance learner like small neural nets. It would definitely change steps 2.b. and 2.c. of their algorithm 10.3. It might also change the initialization at step 1. But it could be done! I am not sure there are any common or popular algorithms in the literature, though. When I looked for them I came across your question instead. It may be that trees are easier to train and have a provably low risk as an ensemble such that no one expects anything else to be able to do better?

Best Answer

Related Solutions

Boosting Weak Learners – Why Are They Called Weak?

Boosting – Why Learning Rate Is Considered a Regularization Parameter

Related Question