[Math] Intuition Behind Accelerated First Order Methods

algorithmsconvex optimizationgradient descentintuitionoptimization

$\newcommand{\prox}{\operatorname{prox}}$
$\newcommand{\argmin}{\operatorname{argmin}}$

Suppose that we want to solve the following convex optimization problem:

$\min_{x \in \mathbb{R}^n} g(x) + h(x)$

where we assumed that $g(x)$ is convex and differentiable, $h(x)$ is convex (here I am trying to be as non-specific as possible). Then recall that the generalized gradient descent can be formulated as follows:

Step $0$: choose initial $x^{0} \in \mathbb{R}^n$

Step $k: k \ge 1$: $x^{(k)} = \prox_h (x^{(k-1)} – t_k \nabla g(x^{(k-1)}), t_k)$

where $\prox$ is a proximal operator defined as $\prox_h(y,t) := \argmin \limits_{x \in \mathbb{R}^n} h(x) + \frac{1}{2t} \|y-x\|^2$

It is known that if $\nabla g(x)$ is Lipschitz continuous and proximal operator can be evaluated, the convergence rate of will be $O(1/k)$. This result can be accelerated to achieve $O(1/k^2)$.

First time proposed by Nesterov in 1983 for smooth functions, the idea of acceleration still remains an active topic of research (for non-smooth, composite functions, etc.). It is not easy to read Nesterov's works (very mathematical), but in order to get an understanding of the concept it is sufficient to look at ISTA (Iterative Tthresholding Algorithms) and FISTA (Fast Iterative Thresholding Algorithms). In particular, my questions below will be based on FISTA's example:

Roughly speaking, acceleration is achieved by introducing one more sequence of numbers $y_k$ constructed as a specific linear combination of $x_k$ and $x_{k-1}$; proximal function operates then on $y_{k}$ instead of $x_k$. In case of FISTA we have:

$t_{k+1} = \frac{1 + \sqrt{1 + 4t^2_k}}{2}$

$y_{k+1} = x_k + \frac{t_k – 1}{t_{k+1}}(x_k – x_{k-1})$

Note, that the sequence $t_k$ satisfies $t^2_k = t^2_{k+1} – t_{k+1}$; this is justified in the proof of the convergence for this algorithm.

Is there any intuitive way to explain, interpret such an approach? Why such a specific combination works and brings such a perceptible improvement to the convergence rate? Can we find an intuitive way to interpret $t$? Probably somebody is actually familiar with Nesterov's works and have more knowledge that I do about some other reasons why $t_k$ is given in this form at first place?

Best Answer

I asked this question myself while trying to understand accelerated methods, asked around, and was pointed to this paper by a professor to help gain some intuition: http://statweb.stanford.edu/~candes/papers/NIPS2014.pdf

To summarize: Su et. al. take Nesterov's accelerated gradient method and take the step size to be infinitesimally small to derive the following ODE $$\ddot{X} + \frac{3}{t} \dot{X} + \nabla f(X) = 0$$ with initial conditions $X(0) = x_0$ and $\dot{X}(0) = 0$. By analyzing this ODE, we can get a better idea about what Nesterov's accelerated gradient method is doing. Hope this resource is helpful!

Related Question