Solved – XGBoost – Can we find a “better” objective function than RMSE for regression

boostingcartgradientgradient descentmachine learning

If we think back to linear models for a moment, we have Ordinary Least Squares (OLS) versus Generalized Linear Models (GLM). Without going too in-depth, it can be said that GLMs "improve" upon OLS by relaxing some of the assumptions, making it more robust to different types of data. The underlying training algorithm is also somewhat different; OLS minimizes the root mean squared error (RMSE) while GLMs minimize deviance. (I realize that RMSE is a special case of deviance). This allows us to build linear models based on, say, the gamma distribution, inverse gaussian, etc.

My question is: does the same logic hold true for gradient boosted trees? Since we're working with tree based algorithms now, I'd think that it's not subject to the same assumptions/distributional restrictions as linear models. In the XGBoost package, for example, the default objective function for regression is RMSE. You can define a custom objective if you wish, but does it matter? Does it make sense to do so?

In other words, can we possibly improve our predictive power by setting XGBoost to minimize deviance (say, of a gamma distribution) versus RMSE?

Best Answer

does the same logic hold true for gradient boosted trees?

Yes, by any mean. Gradient boosting can be used to minimize any sensible loss function, and it is very effective in doing it.

It is worth saying that generalised linear models are generally picked considering not the loss/utility function (that answers the question: how well is doing my model/how bad are its errors), but the kind of random variable you want to model. Then, for instance, if you have a target variable which is the number of some kind of events registered in some time, it makes sense to use a Poisson model. In case you have a rich complex dataset, Xgboost can model a Poisson response much better than a GLM.

You can define a custom objective if you wish, but does it matter?

Of course it does, but I'd like to point out that trees are perfectly non-linear (there is no constraint on the functional form) and thus a model trained with MSE loss can often do quite well even if judged with quite different score functions, even for classification tasks! However, MSE is always symmetric and when circumstances require weighting one tail more than the other (like for gamma regression, or for binary regression, when close to extremes) MSE is not optimal and does not perform as well as the most fitting loss function.

But what is it?

This depends on your goal. For ordinary regression MSE is such an appreciated choice because it models the conditional mean of the target variable, which is often the objective, it benefits from the conceptual link with gaussian variable and central limit theorem, it is fast, and it is actually quite robust. This of course doesn't mean you have to use it, it's just a good standard, but every problem is different and many many times you don't want to predict the conditional mean. For instance you could need to predict the order of magnitude of some measure, then MSE should be applied on the logarithm of that variable, or you could have a situation where outliers are common and shouldn't affect the predictions more than other residuals, in that case MAE is a better loss. You can't list them all because there are infinitely many!