Solved – Why is Newton’s method not widely used in machine learning

gradient descenthessianmachine learningoptimization

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes:

After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient descent to find globally optimal solutions, because Newton's method can provide a guarantee for its solution, it's affine invariant, and most of all it converges in far fewer steps. Why is second-order optimization algorithms, such as Newton's method not as widely used as stochastic gradient descent in machine learning problems?

Best Answer

Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative. That can be faster when the second derivative is known and easy to compute (the Newton-Raphson algorithm is used in logistic regression). However, the analytic expression for the second derivative is often complicated or intractable, requiring a lot of computation. Numerical methods for computing the second derivative also require a lot of computation -- if $N$ values are required to compute the first derivative, $N^2$ are required for the second derivative.

Related Question