does the same logic hold true for gradient boosted trees?
Yes, by any mean. Gradient boosting can be used to minimize any sensible loss function, and it is very effective in doing it.
It is worth saying that generalised linear models are generally picked considering not the loss/utility function (that answers the question: how well is doing my model/how bad are its errors), but the kind of random variable you want to model. Then, for instance, if you have a target variable which is the number of some kind of events registered in some time, it makes sense to use a Poisson model. In case you have a rich complex dataset, Xgboost can model a Poisson response much better than a GLM.
You can define a custom objective if you wish, but does it matter?
Of course it does, but I'd like to point out that trees are perfectly non-linear (there is no constraint on the functional form) and thus a model trained with MSE loss can often do quite well even if judged with quite different score functions, even for classification tasks! However, MSE is always symmetric and when circumstances require weighting one tail more than the other (like for gamma regression, or for binary regression, when close to extremes) MSE is not optimal and does not perform as well as the most fitting loss function.
But what is it?
This depends on your goal. For ordinary regression MSE is such an appreciated choice because it models the conditional mean of the target variable, which is often the objective, it benefits from the conceptual link with gaussian variable and central limit theorem, it is fast, and it is actually quite robust. This of course doesn't mean you have to use it, it's just a good standard, but every problem is different and many many times you don't want to predict the conditional mean. For instance you could need to predict the order of magnitude of some measure, then MSE should be applied on the logarithm of that variable, or you could have a situation where outliers are common and shouldn't affect the predictions more than other residuals, in that case MAE is a better loss. You can't list them all because there are infinitely many!
With sample size $N=30\times10^6$ and 500 features, you already tried (most of) the usual regularization tricks, thus it doesn't look like there's much left to do at this point.
However, maybe the problem here is upstream. You haven't told us what's your dataset, exactly (what are the observations? What are the features?) and what are you trying to classify. You also don't describe in detail your architecture (how many neurons do you have? which activation functions are you using? What rule do you use to convert the output layer result into a class choice?). I will proceed under the assumptions that:
- you have 512 units in input layer, 512 units in each of the hidden layers and 2 units in the output layer. corresponding to $p=525312$ parameters. In this case, your data set seems large enough to learn all weights.
- you're using One-Hot Encoding to perform classification.
Correct me if my assumptions are wrong. Now:
- if you have structured data (this means you're not doing image classification), maybe there's just nothing you can do. Usually XGboost just beats DNNs on structured data classification. Have a look the Kaggle competitions: you'll see that for structured data, usually the winning teams use ensembles of extreme gradient boosted trees, not Deep Neural Networks.
- if you have unstructured data, then something's weird: usually DNNs dominate XGboost here. If you're doing image classification, don't use an MLP. Mostly everyone now uses a CNN. Also, be sure you don't use sigmoid activation functions, but stuff such as ReLU.
- You didn't try early stopping and learning rate decay. Early stopping usually "plays nice" with most other regularization methods and it's easy to implement, so that's the first thing I'd try, if I were in you. In case you're not familiar with early stopping, read this nice answer:
Early stopping vs cross validation
- If nothing else helps, you should check for errors in your code. Can you try to write unit tests? If you're using Tensorflow, Theano or MXNet, can you switch to an high level API such as Keras or PyTorch? One might expect that using an high level API, where less customization is possible, would drive your test error up, not down. However, often the opposite happens, because the higher level API allows you to do the same work with much less code, and thus much less opportunity for mistakes. At the very least, you can be sure your high test error isn't due to coding bugs....
Finally, I didn't add anything about dealing with class imbalance because you seem quite knowledgeable, so I assume you used the usual methods to deal with class imbalance. In case I'm wrong, let me know and I'll add a couple tricks, citing questions dealing specifically with class imbalance if needed.
Best Answer
LightGBM does not always use Hessian information; with that out of the way let's check your questions:
"what's the best way to train such a model on a loss function that has no second derivatives?" The cleanest way will be to approximate/replace that discontinuous loss function with a loss that has second derivatives. For example, for MAE we can use the Pseudo Huber loss with a small $\alpha$. Such a replacement would take care of any inconsistencies we might expect due to discontinuities and guarantees that our implementation with theoretical derivations. That said, if the Hessian is unavailable setting it to an identity matrix makes our Newton step equal to GD step; for that matter LightGBM does exactly that i.e.
hessians[i] = 1.0f;
in the implementation ofRegressionL1loss
. It is important thought to understand why a loss does not have second derivatives and how this affects our model's behaviour. It might have minimal impact or it might signify an important transition. LightGBM developers obviously investigated this before doing such a change (link to relevant issue: here)."why the existing packages use Newton-Raphson iterations for gradient boosting trees as opposed to some variant of raw gradient descent" the answer is that often that usually NR iterations lead to faster convergence to an optimum in the sense that the extra iteration cost (to get the Hessian) if off-setted by the reduced iteration count. On that matter CV.SE has two excellent threads: "https://stats.stackexchange.com/questions/202858" explains in great detail why we care for the NR steps and "https://stats.stackexchange.com/questions/320082" extends as to why we don't use higher dimensions. The same arguments extend immediately to Deep Learning(DL) applications, just in DL those arguments are applicable going from first to second order derivations, rather than second to third as in the case of GBMs. To that extent, note that DL exploded in popularity after automatic differentiation (AD) became mature; without AD backpropagation was tedious at best - and those are first derivatives, let alone second! (To exemplify this: DeepMind/Google's latest DL framework JAX is pretty much AD (in the form of AutoGrad) and faster numerical Linear Algebra (in the form of XLA). Getting good derivatives is hard!)
Revisiting now the mid-question: "Are there any (mainstream) open source tools that would work with this?" Yes, JAX and PyTorch are the obvious candidates; i.e. if we can't get that second derivative, we will just throw our GPUs/TPUs/NPUs/FPGAs until that first derivative wishes it was a second derivative. :D