[Math] What does first and second order convergence mean [with minimal equations] for optimization methods

gradient descentnewton raphsonoptimizationrate-of-convergence

It can be shown that Newton's method has second order convergence provided some criteria is satisfied, and gradient descent has first order convergence, but what does order of convergence mean here?

In numerical methods like finite difference schemes, order of accuracy, e.g., 2nd order accuracy: $O(\Delta x^2)$, means that if you decrease your step size by half, e.g., then your error decrease by a factor of 4. Is there an analog for order of convergence?

I am a bit confused by the different representations I have seen for order of convergence. I think my general understanding is that a second order convergent method will be such that each subsequent iteration will result in a factor of 2 reduction in the error between the optima and the current location. So I believe this also means that if it takes $N$ iterations for a first order method to converge to a certain value, then it will take $\log(N)$ iterations for a second order method to convergence to the same value? Are these correct interpretations?

I'm hoping for a minimal math intuition.

Best Answer

In short, it is said that an iterative method that computes a sequence $\mathbf{x}_0, \mathbf{x}_1, \dotsc, \mathbf{x}_n, \dotsc$ that converges to $\mathbf{x}^*$ (possibly a vector) is of order $p$ if the errors $\mathbf{e}_n = \mathbf{x}_n - \mathbf{x}^*$ satisfy:

$\begin{align*} \lvert \mathbf{e}_{n + 1} \rvert &= c \lvert \mathbf{e}_n \rvert^p + o(\lvert \mathbf{e}_n \rvert^p) \end{align*}$

That is, the error of the next estimate is approximately $c \lvert \mathbf{e}_n \rvert^p$. Larger $p$ means faster convergence. If your $c$ is, say, $0.9$ and $p = 1$, the error gets a tenth less each iteration (one extra digit or so), if $p = 2$, the error gets squared (double the number of correct digits each round). But you have to balance that against the complexity of the method. E.g. in one dimension the secant method is order $1.6$ approximately, while Newton's method is order 2 (quadratic). But secant computes only the function each round, while Newton computes the function and it's derivative. Assuming the cost of computing the derivative is similar to the cost of computing the function, the secant method is faster (in time, not iterations).

You might want to take a look at Brent's Algorithms for Minimization without Derivatives (Dover reissue 2013),

Related Question