Backtracking Line Search Algorithm – Why make $a$ smaller every time

optimization

I am studying line search methods and I stumbled upon the "backtracking algorithm" for calculating a step length $a_k$ that gives a sufficient decrease in our function. To actually compute this sufficient decrease, we use the first Wolfe condition, which is as follows :

$$f(x_k + ap_k) \leq f(x_k) + ca \nabla f_k^Tp_k, c \in (0,1)$$

The backtracking algorithm says: Start with an arbitrary positive $a$. While the above equation isn't true, substitue $a$ with a smaller value, i.e. $$a \leftarrow ρ a$$ where $ρ \in (0,1)$. When we reach a value for $a$ for which the above inequality is satisfied, the algorithm stops.

I have a question that is of a more intuitive nature I think. Why do we make $a$ smaller at every iteration ? Why not make it bigger? If we are talking about a simple convex function like $f(x)=x^2$, I think I can justify the decrease of $a$. But consider the following graph enter image description here

You see that small values of $a$ are acceptable, then there is an interval for which we can't accept $a$ and the finally another interval of acceptable values. If our first estimation of $a$ falls in the interval of non acceptable values, can't we just increase it's value so it goes to the next interval of acceptable values?
Because according to the backtracking algorithm, if we land in an non acceptable value of $a$, we will make smaller and smaller until if enters the first acceptable area.

Is there a reason that we want $a$ to be small and we avoid to make it bigger? Or is it because this particular graph isn't convex ? Not sure.

Sorry if my thoughts are a bit jumbled. Thanks in advance.

Note : Picture is taken from "Numerical Optimization, Nocedal & Wright".

Best Answer

If we increase $a$, we do not know if this will ever result in a valid choice. There is no guaranty for finding another valley farther away from where we are.

But we can be sure that we will find a point below the line $l$ if only we make $a$ small enough. This is because $c\in(0,1),$ which means that $\phi$ is steeper than $l$ at $a=0.$

EDIT: Why can we be sure that we will find a suitable $a$ with $\phi(a)<l(a)$?

Let $g(a)=l(a)-\phi(a).$ Then $g(0)=f(x_k)-f(x_k)=0$ and $g'(0)=(c-1)(\nabla f_k^T p_k)>0,$ because $c-1<0$ and $\nabla f_k^T p_k<0.$

We have $$ g'(0)=\lim_{a\to 0}\frac{g(a)-g(0)}{a-0} = \lim_{a\to 0}\frac{g(a)}{a} $$ From the definition of $\lim$: For each $\varepsilon>0$ there is a $\delta>0$ such that $$ |a-0|<\delta \;\Rightarrow \; \left|\frac{g(a)}{a}-g'(0)\right|<\varepsilon $$ Now set $\varepsilon=\frac12 g'(0)$ and set $\delta$ accordingly, such that the aforementioned implication holds. Then $$ 0<a<\delta \;\Rightarrow\; g'(0)-\frac 12 g'(0) < \frac{g(a)}{a} < g'(0)+\frac 12 g'(0) \\ \;\Rightarrow \; \frac 12 g'(0) < \frac{g(a)}{a} \;\Rightarrow \; g(a) > \frac a2 g'(0) >0 $$