Solved – Understand Adaboost feature selection

adaboost

I would like to understand the mechanism by which a particular variable is selected as the basis for a decision stump in Adaboost. Does this happen randomly? If so, a given variable may well appear more than once as the basis variable for the decision stump.

Also, with Adaboost, we are assigning higher weights to the misclassified instances from using one weak learner, but then using those higher weights to train on a totally different decision stump. This doesn't make sense to me, since the new weak learner is likely to misclassify different instances.

Finally, what happens if we have 20 variables, but 40 iterations in Adaboost. Since a decision tree is deterministic, will we have 2 identical trees for each variable in the Adaboost model?

Thanks in advance

Best Answer

"Feature selection", as you describe it, boils down to fitting a depth 1 decision tree on a set of data. Given such a set and weights associated with each, the problem of fitting a decision tree (a stump) involves finding the "best" variable $x$ and threshold $s$, where the best variable and split threshold are defined as the pair that minimizes some measure of node impurity, like the Gini index. If you are unfamiliar with the process of fitting a decision tree, there are lots of resources around the internet. Here is one example.

So, given a set of candidate variables to split on and a set of training data, there will be a unique solution (a single variable and threshold) that is the best depth 1 decision tree for the current boosting stage. In that sense, there is no randomness. However, the set of variables $X=\{x_1, x_2, \dots \}$ that we can pick our split variable from may be either the entire set of features we have, or it can be a (random) subset. Many implementations of decision tree classifiers will enable the fitting algorithm to randomly pick up a subset of variables at each branching phase. For example, in sklearn's DecisionTreeClassifier, this is governed by the max_features parameter.

It may indeed be the case that a variable appears in more than 1 stump during the AdaBoost procedure- nothing precludes that from happening.

The reason that we weight the mis-classified points more is that those are the ones we want to correct. A particular variable may not be the best variable to split on given equally weighted data, but it may become the best variable to split on once the weights become unevenly distributed. It could be good at correcting the mistakes of a previous learner.

To answer your final question, it isn't guaranteed that each of your 20 features will appear twice in the final result. It could be that several features are repeated, and potentially some are ignored altogether. We can make even fewer gaurantees if the subset of features we can split on at each step is a random subset of the full 40 features.

Finally, I'll point out that the AdaBoost is a meta-estimator that can use any kind of weak learner as the base classifier. The decision tree is a popular choice, but not necessarily the only one. You could use logistic regressions (less common), or even neural networks.

Related Question