Machine Learning – Can Decision Stumps Have More Than 2 Leaves in CART and Ensemble Learning?

adaboostcartensemble learningmachine learning

I understand decision stump: a shallow 1-level decision tree is often used as base-leaner in ensemble methods such as AdaBoost. What is not immediately clear to me is using such ensemble methods in multi-class classification task.

Can a decision stump have more than 2 leaves? Or each decision stump picks only 2-classes of the multi-class and fits model on?

I want to understand how decision stump works in a multiclass classification task. Which of this illustrations (my drawing) describe how it works (3-class problem for instance)?

enter image description here

Best Answer

You used the tag ; CART trees always use binary splits. However, the Quinlan family of decision tree algorithms support multiple-arity splits for categorical features, and such a stump could be used as the base learner for AdaBoost.

However, that doesn't say anything about multiclass targets. All trees support multiclass problems, by using a version of impurity measures that take into consideration multiple classes. Each node contains a subset of your training set, and we record the number in each class. The final class prediction is still just the majority class in a leaf, so a CART stump will predict at most 2 classes (but across the AdaBoost ensemble, probably all of the classes will eventually show up). For example, a stump for the iris dataset (using just two features to keep it from being too good a split to be instructive):
decision stump for iris data; the two leaves have value=(45,6,1) and (5,44,49) resp.
The left leaf contains a large majority from the first class, so a row with first feature <=5.45 would be classified as the first class. The right leaf contains nearly an even split between the second and third classes, but with the third class slightly more prevalent, and so predictions there will be the third class. This leaves the second class un-predicted, so all of its examples are misclassified, and AdaBoost will increase its focus on them in future trees.

The trees can also produce "soft" predictions, the proportion of samples in a leaf for each class; this is done for random forests, but not in AdaBoost to my knowledge, since the reweighting uses information about the misclassified examples. (In the example above, the right leaf would predict probabilities of 5/98, 44/98, and 49/98.)

All that said, you could produce multiple binary models in either a one-vs-one or one-vs-rest fashion if desired instead of relying on the trees' internal multiclass approach.