[Math] does it make sense to talk about third-order (or higher order) optimization methods

convex optimizationnumerical optimizationoptimizationsoft-question

In optimization, most methods I've seen are some variant of gradient descent (popular in machine learning) or methods that use information about second derivatives, ie, the Hessian matrix, with the explanation that near a local minimum, your function probably looks like a quadratic so it's a good approximation.

Maybe this is a naive question, but what about approximations beyond second order? Does it make sense to see the function as a high order polynomial and somehow use that information to come up with better optimization algorithms? Does it not make sense at all or is it just that due to the fact that we currently have a lot of machinery to solve linear systems, second order approximations are easier to take advantage of?

Best Answer

Sure. There are a number of methods, but the most common is called Halley's method. I prefer the paper "On large-scale unconstrained optimization problems and higher order methods" by Gundersen and Steihaug.

That said, there are some good reasons to do this and some good reasons not to do this. The snarky answer as to why not is that Newton's method converges quadratically in a neighborhood of the solution. A third order method can converge cubically, but who cares if we can just do another quadratic iteration? That answer disregards that a second-order method does more than just converge quadratically. It takes care of affine scalings even away from the solution and that's valuable. A third-order method can represent even more complicated scalings and that may be valuable.

Really, the reason not to use one is pragmatic and deals with linear system solvers. Even though we call Newton's method a second-order method, it's really a first derivative trick. Really, what we're doing is writing the first order optimality conditions as a nonlinear system $F(x)=\nabla f(x) = 0$. Then, Newton's method is simply a first order Taylor series solution to this problem:

$F^\prime(x)\delta x + F(x) = 0$

becomes

$\nabla^2 f(x)\delta x + \nabla f(x) = 0$

Alright, so we could do an additional term in the Taylor series. Recall,

$$ F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x)(\delta x) $$

Here's where we run into trouble. $F^\prime(x)=\nabla^2 f(x)$ is a linear system and we know how to solve linear systems on a computer. We can factorize it. We can apply a Krylov method to it. We can do all sorts of things to it. That said, $F^{\prime\prime}(x)\in L(X,L(X))$. Technically, it's a linear operator, but it returns another linear operator, which we can then solve. As such, we don't have a straightforward answer on how to set the Taylor series above to zero and solve. What Halley's method does is cheat. It really solves for the Newton step $\delta x_{\textrm{Newton}}=-F^{\prime}(x)^{-1}F(x)$ and then plugs this into one of the directions for the last derivative. Hence, we have that

$$ F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x_{\textrm{Newton}})(\delta x) $$

Then, we can solve

$$ (F^\prime(x) + \frac{1}{2}F^{\prime\prime}(x)\delta x_{\textrm{Newton}})\delta x = -F(x) $$

or

$$ (\nabla^2 f(x) + \frac{1}{2}\nabla^3 f(x)\delta x_{\textrm{Newton}})\delta x = -\nabla f(x) $$

Alright, a pain, but somewhat workable. Of course, now we have other problems. Someone has to sit down and derive out the derivative, which can be prohibitively difficult. This can be alleviated by using automatic differentiation (AD.) However, if we want to use AD, we have to be smart about things. Using a forward-over-reverse scheme is a good way to get Hessian-vector products efficiently and most AD tools are setup to easily provide this information. They're not entirely setup to provide the third derivative information in the form necessary for Halley's method. Though, they can absolutely do it. Finally, none of these methods are guaranteed to converge away from the optimal solution, so they need to be globalized with a trust-region or line-search. Given that we're now computing two directions for each step, the Newton and the Halley, there's some question as to if we need to globalize both or just one. I know there's an answer to this, but I don't know it off the top of my head. Mostly, it makes implementing the method a pain.

Anyway, in summary, the difficulty with higher-order methods is:

  1. Pain to deal linear systems
  2. Pain to get higher-order derivatives
  3. Pain to deal with globalization

The benefit, in my mind, is the ability to represent more complicated function phenomena than quadratic especially away from the optimal solution. Really, I do think there's benefit in these methods, but I don't know of anyone who has had the patience to really implement them.

Related Question