Solved – how does XGBoost do regression using trees

boostingcartgradientregression

Usually my job is to do classification but recently I have a project which requires me to do regression. That is to say, my response variable is not a binary True/False, but a continuous number. the response is a well formed Gaussian centered at zero.

I usually use XGboost for my classification tasks. To do a regression, I found the following from xgboost's manual:


General parameter

this is the only difference with classification, use reg:linear to do linear regression

when labels are in [0,1] we can also use reg:logistic

objective = reg:linear


I guess it means: regression, using a linear model

However, I looked at the output and I found that the output is still a set of trees. As we know, tree is no linear. So what does reg:linear really mean? From my perspective, there isn't any "linear regression" here.

Can anyone provide any insight here?

Thanks

Best Answer

A regression tree makes sense. You 'classify' your data into one of a finite number of values. Note, that while called a regression, a regression tree is a nonlinear model. Once you believe that, the idea of using a random forest instead of a single tree makes sense. One just averages the values of all the regression trees. Once one has a regression forest, the jump to a regression via boosting is another small logical jump. You are just running a bunch of weak learner regression trees and re-weighting the data after each new weak learner is added into the mix. So boosting can give a 'regression', but it a very non-linear model ! The details are explained in more detail and in more clarity here - https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf . I believe he is one of the authors of the xgboost package.