Solved – Do there exist adaptive step size methods for Newton-Raphson optimization

optimizationreferencesstochastic gradient descent

Stochastic/Mini-batch gradient descent, caused by interest in deep learning, has made lots of advances in adaptive step sizes. For example, Adam, Nadam, Adamax, …, are all improvements to the standard SGD which uses a fixed step size. (Improvements in the sense of finding smaller minimums or more stable convergence.)

However, I haven't seen much research on adaptive step sizes for Newton-Raphson algorithms. For example, would a momentum based algorithm, like Adam, make sense? From what I've seen, most step sizes in Newton-Raphson are fixed <1, use annealing, or use a line search method (the latter which I want to exclude from this question).

What about if we assume the function to minimize is convex? Can we do any better?

Best Answer

In a sense, Newton Raphson is automatically doing the adaptive step size; it's adapting the step in each dimension (which changes the direction) according to the rate of change of the gradient. If the function is quadratic, this the "optimal" update in that in converges in one step.

Of course, generally the target function is not quadratic, or we don't need the iterative Newton Raphson algorithm as there is a closed form solution (i.e., the first step). However, if the problem is convex, we can show that the algorithm will still converge. If the function is particularly unstable (i.e., Hessian is very non-stationary), it's often a good idea to use a line-search method to, for example, check whether half stepping actually decreases the target function more than the full step proposed by vanilla Newton-Raphson.

In general, I cannot see any good reason for an adaptive step size outside using a line search. Adaptive step sizes for first order methods are strongly motivated by trying to adapt the step size by the Hessian, which Newton Raphson already does. An interesting case of why you might want to is if you are doing something like mini-batch updates, where you are approximating the target function by only using a small portion of the full data (I mention this because the SGD tag was used). Then, the fact that Newton Raphson is making an adapted step is actually bad; you don't want to immediately optimize the approximated function because you will be jumping too far and will never converge. However, generally the problems that mini-batching is useful for are the problems for which Newton Raphson cannot handle; they are often highly-multimodal and very high dimensional, so forming and solving for the full Hessian (even from just subsample of the data) is prohibitive.

EDIT:

In the comments, the OP asks about using path history to inform the step size. This is very common in improving first order methods, i.e., ADAM. One thing to note about here is that in the first order methods, one is typically trying to capture 2nd order information about the gradients to update the step. This is useful in the case that the Hessian is relatively stable; if the Hessian varied wildly over the path, using path history to estimate the current Hessian may not be very useful! Shortly, I will argue why in many maximum likelihood problems, we should expect the Hessian to be fairly stable.

So we could extend this concept to second order problems. Recall that each step of Newton Raphson updates according to a local quadratic approximation of the target function. As such, the error in a single step is directly related to the 3rd derivative along the path recommended by the Newton Raphson step. This suggests that learning from the path would improve Newton Raphson in the case where we believe the 3rd derivative will be relatively large relative to the second derivative and somewhat constant. There may well be literature about this out there, but I'm not aware of it.

Interestingly, we can show that this condition of large 3rd derivatives relative to the 2nd derivatives will not happen in the case of either having a large sample size, relative to the number of parameters estimated or the model is heavily penalized with an $L_2$ penalty. The $L_2$ penalty example is easy to show: an $L_2$ penalty affects the second derivative but not the third. If we have a large sample size to parameters estimated ratio, we can defer to maximum likelihood estimation theory; the distribution of the MLE asymptotically approaches a normal distribution. This implies that the log-likelihood approaches a quadratic function...which means that asymptotically, vanilla Newton Raphson's should behave really well! It's been my (by no means exhaustive) experience that on many statistical problems, Newton Raphson only has problems with smaller datasets, and typically converges in less than 10 iterations for moderate to large sample sizes.

None of this discredits the possibility for adaptive update learning improving Newton Raphson. But it does suggest the limited nature of applications where this could be helpful: for problems with lots of data for each parameter estimated, we should expect Newton Raphson to perform well. For extremely high dimensional problems, we can't form and solve for the Hessian anyways. This leaves us with a relatively small class of statistical problems which Newton Raphson can be significantly improved.

Related Question