Solved – Using instance weights in XGBoost

boostingclassificationmachine learningweighted-data

I want to understand whether giving weights to instances across a dataset in XGBoost using the below method makes sense. I switched to this method after trying out a few approaches that didn't fare well (ex: giving weights to class instances depending on event rate- similar to scale_pos_weight, etc.). I am using the following method only for binary classification

  • run an xgboost model without any instance weights
  • get the probability of the event for each instance.
  • Calculate the distance between the probability and 0.5. The idea is to give more weight to instances that have probability closer to the threshold, 0.5, than the ones farther away. This is how I am calculating weights.

    if P(event) <= 0.5: weight = P(event)*2

    else: weight = (1 – P(event))*2

  • run next iterations using results from the previous run to generate weights

The reason behind using the weights this way was to try to keep the model away from instances with very high/very low probability. My interpretation is that the model is fairly confident about those instances and hence should focus elsewhere.

I have run a few experiments (on the adult dataset) and have found that adding weights in this manner does improve the model fit by a few points.

I want to understand how generalizable this approach is. And what potential issues I could be facing using the above method. I am not very familiar with XGBoost internals and need help understanding the implications of the above approach.

Best Answer

The reason behind using the weights this way was to try to keep the model away from instances with very high/very low probability. My interpretation is that the model is fairly confident about those instances and hence should focus elsewhere.

This is similar to AdaBoost, giving higher weights (during fitting, updating weights between trees) to misclassified data. Gradient boosting is similar in spirit to AdaBoost, but different in approach. GBMs in general push probability scores towards 0 and 1.

The biggest difference then is that you give up on data where the model is confident but wrong, whereas the unweighted model will put even more focus there. ["Give up on" is perhaps not completely fair. The loss function will be high on these points, but their weights will be small. I guess it depends on the exact tradeoff whether the model will care more about them than pushing points away from 0.5 predictions.]

I have run a few experiments (on the adult dataset) and have found that adding weights in this manner does improve the model fit by a few points.

Just to check: you are reporting an improvement on separate test data, right?

I want to understand how generalizable this approach is. And what potential issues I could be facing using the above method. I am not very familiar with XGBoost internals and need help understanding the implications of the above approach.

The first thing to come to mind is that this is prone to overfit, especially if you repeat the process many times. Second, given my comment about giving up on badly misclassified points, I would guess that this method is good at ignoring outliers in otherwise cleanly separated data, but bad at messy overlapping data. At any rate, you can easily detect that by just comparing performance of the unweighted and final weighted models on a validation set.

Anyway, the proof is in the pudding, so do let us know if you continue using the method how it goes!

Related Question