Nesterov’s Momentum Method and Armijo’s Rule

convex optimizationgradient descentnumerical methodsoptimization

Nesterov's momentum method can be defined through gradient descent and momentum stages as follows:
\begin{align}
v_{k+1} &= \beta v_k – \alpha\nabla f(x_k + \beta v_k) \\
x_{k+1} &= x_k + v_{k+1}
\end{align}

with $\beta$ defined as a tuning parameter and, in this case, $\alpha_k = s\gamma^m$ such that $s > 0$, $\gamma \in [0,1)$, and $\sigma >0$ with $\alpha_k$ obtained via Armijo's rule:

$$f(x_k) – f(x_k + s\gamma^m d_k) \geq -\sigma\gamma^ms\nabla f(x_k)^Td_k$$

Usually, $d_k = -D_k\nabla f(x_k)$ with $D_k$ being some positive definite matrix. ie: in steepest descent $D_k = I$ or in Newton's method $D_k = [\nabla^2f(x_k)]^{-1}$

Question: What is $d_k$ for Nesterov's momentum method in order to utilize Armijo's rule? Is it the same as steepest descent? ie: $D_k = I$ so $d_k = -\nabla f(x_k)$? I've done research on a few academic articles, but none have incorporated Armijo's rule with the momemtum method. Thanks!

Best Answer

Beware that there are several formulations of Nesterov's method (which are not all completely equivalent), the one you are referring to likely comes from the machine community. Please allow me to propose the following (classical) formulation in which the choice of $d_k$ will appear more clearly: $$ \begin{cases} y_k &= x_k + \delta_k(x_k-x_{k-1})\\ x_{k+1} &=y_k - \alpha_k\nabla f(y_k) \end{cases}, $$ where $\delta_k\in[0,1)$ is an hyper-parameter (Nesterov's historical choice is $\delta_k = k/k+3$). Now, Armijo's rule consists in choosing the step-size $\alpha_k$ so as to ensure a sufficient decrease of $f$ between two iterations $x_k$ and $x_{k+1}$. Using your notations, in Armijo's strategy, we are looking for a step-size $\alpha_k$ of the form $s\gamma^m$ where $m\in\mathbb{N}$ is what we have to choose. The update thus rewrites $x_{k+1} = x_k-s\gamma^m\nabla f(y_k)$ and we want, $$f(x_k)-f(x_{k+1})=f(x_k)-f(x_k-s\gamma^m\nabla f(y_k))\geq - \sigma s\gamma^m\nabla f(x_k)^T(-\nabla f(y_k)).$$

So in short, $d_k=-\nabla f(y_k)$ is the direction in which we update $x_k$.

The reason why Armijo's rule is not so popular for Nesterov's method might be that it requires one to evaluate $\nabla f(x_k)$ (in addition to $\nabla f(y_k))$.

Related Question