[Math] Why is Newton’s method faster than gradient descent

gradient descentnumerical optimizationoptimization

Can you provide some intuition as to why Newton's method is faster than gradient descent?

Often we are in a scenario where we want to minimize a function f(x) where x is a vector of parameters. To do that the main algorithms are gradient descent and Newton's method. For gradient descent we need just the gradient, and for Newton's method we also need the hessian. Each iteration of Newton's method needs to do a linear solve on the hessian:

$$x \leftarrow x – \textrm{hess}(f,x) \backslash \textrm{grad}(f,x)$$

Where \ indicates doing a linear solve (like in matlab). In many cases the hessian is sparse, and in that case we would use an iterative method like conjugate gradient descent to do the linear solve. So what we're actually doing is this:

  1. Compute a quadratic approximation of f around x.
  2. Minimize that quadratic approximation with conjugate gradient descent.
  3. Make a step in the direction of that minimum.

Compare this to normal gradient descent, where we just run (conjugate) gradient descent directly on the unapproximated original function f.

My question is: why is it faster to repeatedly do a quadratic approximation, then minimize that approximation with gradient descent, and then do a step in that direction, than to simply run gradient descent on the original function?

Best Answer

The quick answer would be, because the Newton method is an higher order method, and thus builds better approximation of your function. But that is not all.


Newton method typically exactly minimizes the second order approximation of a function $f$. That is, iteratively sets $$ x \gets x - \left[ \nabla^2 f(x) \right]^{-1} \nabla f(x). $$

Gradient has access only to first order approximation, and makes update $$ x \gets x - h \nabla f(x), $$ for some step-size $h$.


Practical difference is that Newton method assumes you have much more information available, makes much better updates, and thus converges in less iterations.

If you don't have any further information about your function, and you are able to use Newton method, just use it.

But number of iterations needed is not all you want to know. The update of Newton method scales poorly with problem size. If $x \in \mathbb{R}^d$, then to compute $ \left[ \nabla^2 f(x) \right]^{-1} $ you need $\mathcal{O}(d^3)$ operations. On the other hand, cost of update for gradient descent is linear in $d$.

In many large-scale applications, very often arising in machine learning for example, $d$ is so large (can be a billion) that you are way beyond being able to make even a single Newton update.