Backtracking line search: Are the bounds on $\alpha$ $(0,1)$ or $(0, 1/2)$

optimization

In my optimization class we are using the Boyd and Vandenberghe Convex Optimization textbook, which gives the following stopping criterion for backtracking line search (p. 464):

$$f(x + t \Delta x) \leq f(x) + \alpha t \nabla f(x)^T \Delta x$$

Here $\alpha$ is a parameter on $(0, 1/2)$. These bounds are used when proving the following fact related to gradient descent: If $0 \leq t \leq 1/M$ and $\alpha \lt 1/2$, then

$$f(x – t \nabla f(x)) \leq f(x) – \alpha t || \nabla f(x) ||^2$$

The proof given is

$$\begin{align}
f(x – t \nabla f(x)) & \leq f(x) – t || \nabla f(x) ||^2 + \frac{Mt^2}{2} || \nabla f(x) ||^2 &\text{(Taylor's theorem)}\\
& \leq f(x) – \frac{t}{2} || \nabla f(x) ||^2 & (\text{because} -t + Mt^2/2 \leq -t/2)\\
& \leq f(x) – \alpha t || \nabla f(x) ||^2
\end{align}$$

But considering the geometry of the method, it seems to me like it should be "safe" to pick any $\alpha \lt 1$. And this more general limit is mentioned on Wikipedia and this wiki.

Beyond the simplicity of the proof above, is there a deeper reason for restricting $\alpha$ to $(0, 1/2)$?

Best Answer

First, for gradient descent (or steepest descent in general), $\alpha\in (0,1)$ is enough. It is easy to adapt the proof presented in the book to this case. I show below such a proof for gradient descent, but for steepest descent it is similar.

We will show that for $0\le t\le \frac{2(1-\alpha)}{M}$ and $\alpha\in (0,1)$ we have $$f(x - t \nabla f(x)) \leq f(x) - \alpha t \| \nabla f(x) \|_2^2.$$ Indeed, \begin{align} f(x - t \nabla f(x)) & \leq f(x) - t \| \nabla f(x) \|_2^2 + \frac{Mt^2}{2} \| \nabla f(x) \|_2^2\\ & = f(x) - \left(1 - \frac{Mt}{2} \right)t \|\nabla f(x) \|^2 \\ & \leq f(x) - \alpha t || \nabla f(x) ||^2. \end{align} Now the proof proceeds as presented in the book. Update for clarification: You should replace accordingly the values in the remaining part of the proof. For example, instead of "Therefore the backtracking line search terminates either with $t = 1$ or with a value $t\ge \beta/M$", it should now reads "Therefore the backtracking line search terminates either with $t = 1$ or with a value $t\ge 2(1-\alpha)\beta/M$". And so on.

Second, and more interestingly:

Beyond the simplicity of the proof above, is there a deeper reason for restricting $\alpha$ to $(0,1/2)$.

The reason why the authors restricted $\alpha \in (0,1/2)$ is, apparently, because they want to use the same bounds of $\alpha$ for all the methods presented in that chapter of the book. The tricky part here is that the analysis of the damped Newton method requires $\alpha < 1/2$ (pages 490-491, and later on page 505 for self-concordant functions). For the other ones, $(0,1)$ is enough (but the authors took advantage of their assumption $\alpha < 1/2$ to simplify expressions, so their proofs need to be adapted to the $(0,1)$ case).

A better presentation would have been using $(0,1)$ for the general case, and restricting to $(0,1/2)$ when necessary. (Or at least making it clear when $(0,1/2)$ is necessary instead of "The reason for requiring $\alpha$ to be smaller than $0.5$ will become clear later.")

Best Answer

Related Solutions

Optimization – Do Primal-Dual Methods Need to Start with a Feasible Point?

[Math] Difference between Gradient Descent method and Steepest Descent

Related Question