Solved – Are XGBoost probability outputs based on the number of examples in a terminal leaf

boostingcartclassificationdata miningmachine learning

I am trying to replace a c4.5 tree that someone else implemented with a boosted tree (XGBoost).
The data is extremely skewed and the company wants the new model to output similar distributions.

c4.5 trees determine probabilities based on the number of observations that end in a terminal leaf, and I was wondering if that is the case with XGBoost.

Best Answer

Are XGBoost probability outputs based on the number of examples in a terminal leaf?

No. XGBoost is a gradient boosted tree, so it's estimating weights $c \in \mathbb{R^M}$ that assigns weight the $M$ leafs. A sample prediction (on the logit scale) is the sum of its leafs' weights. In the binary case, the inverse logistic function of the logit score yields a predicted probability.

The XGBoost paper has a helpful description of how it works.

Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System."

See also: In XGboost are weights estimated for each sample and then averaged

Related Question