Boosting Algorithms – How Does the `subsample` Parameter Work in XGBoost and LightGBM?

boostinglightgbmmachine learningsampling

From what I know, both of them are sequential learners and only the 1st tree in the sequence gets built on the data and all the following trees that get built are to correct the mistakes of previous tree, hence improving the performance or decreasing the bias.

subsample parameters in xgboost and lightgbm dictates the percentage of rows used per tree building.

So, with this context, if subsample is set to 0.75, first tree gets built with 75% of the data and all the following trees will focus on correcting mistakes. So, what happens to the remaining 25% of the data? will another set of sequential tress be built parallelly? or am I missing something here or got something wrong?

Best Answer

That 25% of the data is unused by that learner (i.e. that iteration) of the XGBoost model (assuming subsample=0.75). This is normal, what they do via the subsample argument is to implement bagging by subsampling once in every boosting iteration. This means that, as you described, a portion of the data is not used by that specific base-learner during the $i$-th iteration. In the $i+1$-th iteration, sub-sampling of the whole dataset is performed once again.

Through bagging (or more formally bootstrap aggregation) we practically bootstrap our estimator and allow ourselves to have a more robust overall result - think of it as estimating the sample mean via bootstrapping; just the "mean" here is the "expected prediction for each item" in our sample instead of a single "expected prediction for the sample's central tendency".

And note that irrespective of the subsampling proportion we might use (e.g. 10%), we can always provide estimates for all (i.e. 100%) of our sample items. Some will be out-of-sample of course and some in-sample. In that way, we can also estimate error gradients for all our sample items if needed.