Gradient boosting on a loss/objective function without second derivatives

boostinglightgbmloss-functionspython

In principle, it should be possible to build a gradient boosted tree model on a loss function that only has (nonzero) first derivatives. I've found in practice xgboost and lightgbm make heavy use of second-order information in their custom objective function API. However, lightgbm (for that matter sklearn's random forest too) apparently can train on mean absolute error directly… So if I have structured data that I'd like to build a gradient boosted tree model on, what's the best way to train such a model on a loss function that has no second derivatives? Are there any (mainstream) open source tools that would work with this?

If the answer is "no", I'd love any intuition anyone can provide on why the existing packages use Newton-Raphson iterations for gradient boosting trees as opposed to some variant of raw gradient descent (as is so common in deep learning). Maybe that can give me some intuition as to how to appropriately smooth utility functions that have no second derivatives for the purposes of training gradient boosted models.

Best Answer

LightGBM does not always use Hessian information; with that out of the way let's check your questions:

"what's the best way to train such a model on a loss function that has no second derivatives?" The cleanest way will be to approximate/replace that discontinuous loss function with a loss that has second derivatives. For example, for MAE we can use the Pseudo Huber loss with a small $\alpha$. Such a replacement would take care of any inconsistencies we might expect due to discontinuities and guarantees that our implementation with theoretical derivations. That said, if the Hessian is unavailable setting it to an identity matrix makes our Newton step equal to GD step; for that matter LightGBM does exactly that i.e. hessians[i] = 1.0f; in the implementation of RegressionL1loss. It is important thought to understand why a loss does not have second derivatives and how this affects our model's behaviour. It might have minimal impact or it might signify an important transition. LightGBM developers obviously investigated this before doing such a change (link to relevant issue: here).

"why the existing packages use Newton-Raphson iterations for gradient boosting trees as opposed to some variant of raw gradient descent" the answer is that often that usually NR iterations lead to faster convergence to an optimum in the sense that the extra iteration cost (to get the Hessian) if off-setted by the reduced iteration count. On that matter CV.SE has two excellent threads: "https://stats.stackexchange.com/questions/202858" explains in great detail why we care for the NR steps and "https://stats.stackexchange.com/questions/320082" extends as to why we don't use higher dimensions. The same arguments extend immediately to Deep Learning(DL) applications, just in DL those arguments are applicable going from first to second order derivations, rather than second to third as in the case of GBMs. To that extent, note that DL exploded in popularity after automatic differentiation (AD) became mature; without AD backpropagation was tedious at best - and those are first derivatives, let alone second! (To exemplify this: DeepMind/Google's latest DL framework JAX is pretty much AD (in the form of AutoGrad) and faster numerical Linear Algebra (in the form of XLA). Getting good derivatives is hard!)

Revisiting now the mid-question: "Are there any (mainstream) open source tools that would work with this?" Yes, JAX and PyTorch are the obvious candidates; i.e. if we can't get that second derivative, we will just throw our GPUs/TPUs/NPUs/FPGAs until that first derivative wishes it was a second derivative. :D

Best Answer

Related Solutions

Solved – XGBoost – Can we find a “better” objective function than RMSE for regression

Solved – Final layer of neural network responsible for overfitting

Related Question