Biggest Step Size for Gradient Descent with Lipschitz Continuous Gradient – Convex Optimization

convex optimizationconvex-analysisgradient descentlinear algebra

Given a convex function $ f \left( x \right) : \mathbb{R}^{n} \to \mathbb{R} $ with $ L $ – Lipschitz Continuous Gradient. Namely:

$$ {\left\| \nabla f \left( x \right) – \nabla f \left( y \right) \right\|}_{2} \leq L {\left\| x – y \right\|}_{2} $$

What is the largest constant step size, $ \alpha $, one could use in Gradient Descent to minimize the function?
In most literature I see $ \alpha = \frac{1}{L} $ yet in some other cases I see $ \alpha = \frac{2}{L} $. Which one is right?

Also, for the case $ f \left( x \right) = \frac{1}{2} {\left\| A x – b \right\|}_{2}^{2} $ what is $ L $? Is it the largest Singular Value of $ A $?

Best Answer

Let $y$ be the one step of gradient descent from $x$, i.e., $y = x-\alpha \nabla f(x)$. Since $f$ is Lipschitz gradient function with constant $L$, we have \begin{equation} \begin{aligned} f(y) - f(x) & \leq \langle \nabla f(x), y-x \rangle + \frac{L}{2}\|y-x\|^2 \\ &= -\alpha \|\nabla f(x)\|^2 + \frac{L}{2}\alpha^2\|\nabla f(x)\|^2. \end{aligned} \end{equation} To guarantee the sufficient decreasing, we need \begin{equation} \frac{L}{2}\alpha^2 - \alpha \leq 0 \Rightarrow \alpha \leq \frac{2}{L}. \end{equation} In addition, One can easily find the optimal step size will be $\frac{1}{L}$. So if I remember, the step size proving the convergence is $\frac{2}{L}$(c.f. Boris T. Polyak - Introduction to Optimization, Page 21), and the optimal choice is $\frac{1}{L}$(c.f. Yurii Nesterov - Introductory Lectures on Convex Programming, Page 29).

However, even though the gradient descent can converge when $0 < \alpha < \frac{2}{L}$, the convergence rate $O(1/k)$ only could be guaranteed for $0 < \alpha < \frac{1}{L}$ (c.f.Gradient Descent: Convergence Analysis, Theorem 6.1).

For $f(x) = \frac{1}{2}\|Ax-b\|^2$, thus \begin{equation} \|\nabla f(x) - \nabla f(y)\| = \|A^TA(x-y)\| \leq \|A^TA\| \|x-y\|. \end{equation} The inequality is induced by operator norm, i.e., the largest eigenvalue of $A^TA$, also the largest singular value of $A$.

Related Question