XGBClassifier – Utilizing Sample Weights in Machine Learning

boostingclassificationmachine learningscikit learn

I am using Scikit-Learn XGBClassifier API with sample weights. If I multiply sample weights by 2, I get totally different results with exact same parameters and random_state, I am expecting that If we multiply/divide sample weight with a constant, results should not change. Do you have any suggestion?

Best Answer

What you describe, while somewhat unusual it is not unexpected if we do not optimise our XGBoost routine adequately. Your intuition though is correct: "results should not change".

When we change the scale of the sample weights, the sample weights change the deviance residuals associated with each data point; i.e. the use of different sample weights' scale, results in our GBM to train on a different sample per se. When performing gradient boosting iteration, the residuals that serve as leaf weights are multiplied by that sample weights. Therefore the fit themselves are different especially during the first few iterations of XGBoost. Usually the difference in the fit due to different sample weights' scale is not substantial and will ultimately smooth out but it can noticeable (especially during the first iterations).

# Using Python 3.6.9 // xgboost 0.90
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
import xgboost as xgb

w=np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 
w_2=w*2
X=pd.DataFrame([13.36, 5.35, 0.26, 84.16, 24.67, 22.26, 18.02, 14.20, 61.66, 57.26])
y=pd.DataFrame([37.54, 14.54, -0.72, 261.19, 76.90, 67.15, 53.89, 43.48, 182.60, 179.44])
X_test=pd.DataFrame([0.5]) 

xgb_model=XGBRegressor(n_estimators=100, learning_rate=1, 
                       objective='reg:squarederror', subsample=1, reg_lambda=0.1)

xgb_model.fit(X, y, sample_weight=w)
print(xgb_model.predict(X_test, ntree_limit=5)) 
# [-0.65936375]
print(xgb_model.predict(X_test))
# [-0.71998453]

xgb_model.fit(X, y, sample_weight=w_2)
print(xgb_model.predict(X_test, ntree_limit=5))
#[-0.76515234]
print(xgb_model.predict(X_test))
# [-0.7199712]

As we can see, using either the initial "unit weights" w or their scaled version w_2 returns effectively the same estimate (~0.7199...) when we optimise "enough" (e.g. after doing 100 iterations). Nevertheless, when we are starting the first estimates can be substantially different (-0.6593... against -0.7651...). (Notice the observed behaviour is a bit version dependant. I played with XGBoost ver. 1.0.1 and the difference tapers off very quickly at about ntree_limit=4.)

If we observe substantial difference between the estimates of two boosters where the only difference is the scale of the sample weights, this is primarily indicative of two things:

we have not optimised the boosters adequately so they have not yet reach a steady state. We need to optimise further (e.g. have more iterations).
we have over-fitted our samples so the boosters interpreter the differences in the scaling of sample weights as material difference. We need to regularise more strongly (e.g. have higher regularisation parameters reg_alpha and reg_lambda).

Related Solutions

Solved – PCA principal components in sklearn not matching eigen-vectors of covariance calculated by numpy

While this is a pure python related question which is not fitted here for CrossValidated, let me help you anyway. Both procedures find the correct eigenvectors. The difference is in its representation. While PCA() lists the entries of an eigenvectors rowwise, np.linalg.eig() lists the entries of the eigenvectors columnwise. Remember that eigenvectors are only unique up to a sign. Indeed, a simple check yields:

print(abs(eig_vec.T.round(10))==abs(pca.components_.round(10)))
[[ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True]])

Solved – Using instance weights in XGBoost

The reason behind using the weights this way was to try to keep the model away from instances with very high/very low probability. My interpretation is that the model is fairly confident about those instances and hence should focus elsewhere.

This is similar to AdaBoost, giving higher weights (during fitting, updating weights between trees) to misclassified data. Gradient boosting is similar in spirit to AdaBoost, but different in approach. GBMs in general push probability scores towards 0 and 1.

The biggest difference then is that you give up on data where the model is confident but wrong, whereas the unweighted model will put even more focus there. ["Give up on" is perhaps not completely fair. The loss function will be high on these points, but their weights will be small. I guess it depends on the exact tradeoff whether the model will care more about them than pushing points away from 0.5 predictions.]

I have run a few experiments (on the adult dataset) and have found that adding weights in this manner does improve the model fit by a few points.

Just to check: you are reporting an improvement on separate test data, right?

I want to understand how generalizable this approach is. And what potential issues I could be facing using the above method. I am not very familiar with XGBoost internals and need help understanding the implications of the above approach.

The first thing to come to mind is that this is prone to overfit, especially if you repeat the process many times. Second, given my comment about giving up on badly misclassified points, I would guess that this method is good at ignoring outliers in otherwise cleanly separated data, but bad at messy overlapping data. At any rate, you can easily detect that by just comparing performance of the unweighted and final weighted models on a validation set.

Anyway, the proof is in the pudding, so do let us know if you continue using the method how it goes!

Best Answer

Related Solutions

Solved – PCA principal components in sklearn not matching eigen-vectors of covariance calculated by numpy

Solved – Using instance weights in XGBoost

Related Question