[Math] sublinear rate of convergence in mathematical optimization

convex optimizationoptimizationrate-of-convergence

In Wiki page, the sublinear convergence rate refers that for a sequence $\{x_k\}$ with limit $L$,
\begin{align*}
\limsup \frac{\|x_{k+1} – L\|} {\|x_k – L\|} = 1.
\end{align*}

In most convex optimization books, a sequence converges at a rate $\mathcal O(1/k)$ also referred as sublinear. I am a little bit confused on the exact meaning of sublinear rate. Could anyone please clarify this for me?

Also if we have sequence $\{a_k\}$ such that $a_{k+1} \le q_k a_{k}$ with $q_k < 1$ for every $k$ but $\lim_{k \to \infty} q_k = 1$. By Wiki definition, this convergence rate should be sublinear. How is this rate compared to $\mathcal O(1/k)$? Is it faster or slower?

Best Answer

You need to start by understanding that traditional asymptotic analysis of sublinear/linear/superlinear/quadratic convergence of optimization algorithms isn't directly comparable to the non-asymptotic analysis of first-order methods in terms of the number of iterations needed to obtain an epsilon-optimal solution (e.g. $O(1/\epsilon)$), or the reduction that can be obtained after $k$ iterations (e.g. $O(1/k)$.)

Asymptotic analysis is about what happens in the limit as $x^{(k)}$ approaches $x^{*}$. The other kind of analysis is about what happens after a relatively small number of iterations, well before the algorithm has shown its limiting behavior.

Quasi-Newton and Newton methods have asymptotic convergence rates that are at least superlinear (quadratic for Newton's method.) In practice, we often run these algorithms until they have converged to the full available accuracy of the floating point (typically double precision) arithmetic in use. As a result, we typically see the limiting asymptotic behavior of these methods. For example, in Newton's method, it is common to see that the number of accurate digits in the solution doubles in each of the last few iterations until the limits of the floating point arithmetic are reached. Any superlinear or quadratically convergent algorithm that approaches the limiting behavior will have a fully accurate solution in a few iterations.

In comparison, in the brave new world of first-order methods for large-scale convex but non-smooth problems, we can seldom achieve more than 3-4 digits of accuracy in the solution. The algorithms are typically stopped long before full convergence. We're not particularly interested in the asymptotic convergence rate (which is typically sublinear) because we'll never get anywhere near the limiting behavior. Rather, we're interested in how many iterations are required to find a solution that is adequate for our purposes.

In particular, we define an $\epsilon$ approximate solution to be a solution $\hat{x}$, with $f(\hat{x})-f(x^{*}) \le \epsilon$. For a fixed $\epsilon$, we can then analyze how many iterations are required to obtain an $\epsilon$ approximate solution and show that (for example) our algorithm can find an $\epsilon$ approximate solution in $O(1/\epsilon)$ iterations. Alternatively, we might analyze our algorithm to show that after $k$ iterations, the difference $f(x^{(k)})-f(x^{*})$ is (for example) $O(1/k)$. Here the constant hidden by the $O()$ would involve the distance from $x^{(0)}$ to $x^{*}$.

It's possible for two first order methods to both have sublinear asymptotic convergence in the limit as $x^{(k)}$ approaches $x^{*}$ but have $O(1/k)$ and $O(1/k^{2})$ convergence respectively as measured by $f(x^{(k)})-f(x^{*})$. Clearly, the $O(1/k^{2})$ algorithm is to be preferred, even though the asymptotic convergence is sublinear in both cases.

One final comment is that none of this relates to the polynomial time complexity of interior point methods for LP, SOCP, and SDP. That's yet another way of analyzing the convergence of some optimization algorithms. It's important from the point of view of computational practice in solving very large-scale optimization problems that polynomial time sometimes isn't good enough. For truly large-scale problems you might not even be able to complete one iteration in the available time. Similarly, a superlinearly convergent algorithm isn't of much use if you can't run it for enough iterations to see the asymptotic superlinear convergence.

Some further reading:

Beck, Amir. First-Order Methods in Optimization. Philadelphia : SIAM-Society for Industrial and Applied Mathematics, 2017.

Bertsekas, Dimitri P. Convex Optimization Algorithms. 1 edition. Belmont, Massachusetts: Athena Scientific, 2015.

Nesterov, Yurii. Lectures on Convex Optimization. 2nd ed. New York, NY: Springer, 2018.

Best Answer

Related Solutions

[Math] Rate of convergence of a sequence in $\mathbb{R}$ and Big O notation

Steepest-descent optimization procedure with step size given by harmonic sequence

Related Question