Solved – what is instance weight in boosting

boostingmachine learningweighted-data

Hi, I am reading something about Boosting, and I had hard time understanding one of the steps in boosting – assign greater weights to those instances.

What does the sentence – assign greater weights to those instances mean ? My understanding is.. for example

Initially, we have training data ($x_1$,$y_1$) ($x_2$,$y_2$) ($x3$,$y_3$) ($x_4$,$y_4$) ($x_5$,$y_5$)
after we first apply the weak learner, we find that (x2,y2) (x3,y3) are misclassified, and we try to adjust the training data, "assigning the weights" so that new training data become… like ($x_1$,$y_1$) ($x_2$,$y_2$),($x_2$,$y_2$) ($x_3$,$y_3$) ($x_3$,$y_3$) ($x_4$,$y_4$) ($x_5$,$y_5$), where we have more misclassified instances ? thus, next learner have more chances to learn the misclassified ones ?

Best Answer

"Instance" is just a somewhat confusing way of saying "case" or "person" or "observation," etc.

Imagine we have N data points we are trying to predict; each of those data points would be an "instance." If our data look like:

Then we have 5 "instances" and each row (observation, case, etc.) represents an instance. Imagine we predict y from x using a weak learner. We find that instance #3 (y = 0, x = 3) is classified incorrectly. In the next iteration, we would weight that instance higher than the others.

I wouldn't necessarily say that the learner "has more chances to learn the misclassified ones," as every instance/case/row/observation is included in each iteration. It is just that subsequent learners focus more on misclassified instances.

Related Solutions

Solved – Predicting multiple output variables based on multiple input variables

You can use Multivariate Multiple Linear Regressior like blow:

fit <- lm(cbind(Y1, Y2) ~ X1 + X2 + X3, data=train_set)

Solved – Using instance weights in XGBoost

The reason behind using the weights this way was to try to keep the model away from instances with very high/very low probability. My interpretation is that the model is fairly confident about those instances and hence should focus elsewhere.

This is similar to AdaBoost, giving higher weights (during fitting, updating weights between trees) to misclassified data. Gradient boosting is similar in spirit to AdaBoost, but different in approach. GBMs in general push probability scores towards 0 and 1.

The biggest difference then is that you give up on data where the model is confident but wrong, whereas the unweighted model will put even more focus there. ["Give up on" is perhaps not completely fair. The loss function will be high on these points, but their weights will be small. I guess it depends on the exact tradeoff whether the model will care more about them than pushing points away from 0.5 predictions.]

I have run a few experiments (on the adult dataset) and have found that adding weights in this manner does improve the model fit by a few points.

Just to check: you are reporting an improvement on separate test data, right?

I want to understand how generalizable this approach is. And what potential issues I could be facing using the above method. I am not very familiar with XGBoost internals and need help understanding the implications of the above approach.

The first thing to come to mind is that this is prone to overfit, especially if you repeat the process many times. Second, given my comment about giving up on badly misclassified points, I would guess that this method is good at ignoring outliers in otherwise cleanly separated data, but bad at messy overlapping data. At any rate, you can easily detect that by just comparing performance of the unweighted and final weighted models on a validation set.

Anyway, the proof is in the pudding, so do let us know if you continue using the method how it goes!

Best Answer

Related Solutions

Solved – Predicting multiple output variables based on multiple input variables

Solved – Using instance weights in XGBoost

Related Question