Boosting – How Does XGBoost Select Which Feature to Split On?

boostingcartrandom forest

Pretty straightforward question. Does it check every feature and pick the one that maximizes some split-quality metric (like a decision tree), or does it just select a feature at random (like a random forest)?

Best Answer

I believe that the xgboost paper is a little underprecise on this point but I will share my interpretation of what they describe. This understanding is not based on reading the code so take it with a grain of salt and hope for a braver soul to put a finer point on this anwer (which I found because I also had it after reading the paper).

The lead-in to the discussion of the approximate algorithm for choosing a split point says

. . . a split finding algorithm enumerates over all the possible splits on all the features"

and then goes on to say that this exhaustive search is unfeasible for large datasets and also does not interact well with distributed training so that an approximate algorithm is needed. This algorithm is described in some detail in the appendix and it is (I believe) specifically an algorithm for finding candidate split points for a given feature.

Their description of how this approximate algorithm is used says

To summarize, the algorithm first proposes candidate splitting points according to percentiles of feature distribution (a specific criteria will be given in Sec. 3.3). The algorithm then maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

I believe that this means that candidate splits are proposed for every single feature in the dataset and that all of these candidate splits are inspected to find the optimal one. This means that the method of finding splits really is an approximate form of the algorithm that inspects every single feature and every single split to find the best one but that the approximation only happens at the level of the splits proposed for each feature. Every single feature is still inspected.