[Math] Examples where constant step-size gradient descent fails everywhere

gradient descentnumerical optimizationoptimization

Just trying to build an intuition for convergence conditions for gradient descent algorithms. One theorem says that if $f:\mathbb{R}^n\to\mathbb{R}$ has a gradient satisfying the Lipschitz condition $$\|\nabla f(y) – \nabla f(x)\|\leq L\|y – x\| $$ then for step size $\eta$ taken small enough ($\eta < \frac{2}{L})$ the steepest descent method given by $$x_{k+1} = x_k – \eta \nabla f(x_k)$$ then for any starting point $x_0$ the sequence is guaranteed to converge to a stationary point $x^*$, that is a point satisfying $\nabla f(x^*)=0$.

See Theorem 4.2 in http://users.ece.utexas.edu/~cmcaram/EE381V_2012F/Lecture_4_Scribe_Notes.final.pdf

I am wondering about examples demonstrating how the Lipschitz condition is necessary. Is there an example of a smooth function with minima but so that no matter where you start with non-stationary $x_0$ you will not get convergence of gradient descent even for arbitrarily small step sizes?

Best Answer

Let us consider the one dimensional example \begin{align} f(x)&=\frac23|x|^{3/2}\qquad \text{(the blue graph)}\\ f'(x)&=\text{sign}(x)\cdot|x|^{1/2}\qquad \text{(the red graph)} \end{align} enter image description here

Fix any $\eta>0$ and try to find an starting point $x_*>0$ that gives no convergence, namely, the oscillating sequence $x_*,-x_*,x_*,-x_*,\ldots$. For that we need $$ -x_*=x_*-\eta\sqrt{x_*}\quad\Leftrightarrow\quad 2x_*=\eta\sqrt{x_*} \quad\Leftrightarrow\quad x_*=\left(\frac{\eta}{2}\right)^2. $$

EDIT: Since the OP seems to be interested in really almost all starting points with no convergence, let us show that it is the case here. Let $x_0\in(0,x_*)$. Then we have $$ 2\sqrt{x_0}(\sqrt{x_0}-\frac{\eta}{2})<0\quad\Leftrightarrow\quad x_0-\eta\sqrt{x_0}<-x_0. $$ It means that the iterations will keep outside and never enter the interval $[-x_0,x_0]$ proving the origin (i,e, the optimal point) to be unstable. Since it is the only fixed point for the algorithm, almost all starting points are going to show no convergence behavior (all points except for an at most countable set of points that converge in finite number of steps).

Intuition: since $|\nabla f(x)|\gg |x|$ near the origin (the optimal point), the iteration $x-\eta\nabla f(x)$ takes too large step, misses the target and goes beyond $-x$, contributing to an oscillating behaviour of the algorithm. It makes the origin to be an unstable fixed point. To save the convergence one has to damp oscillations by choosing $\eta_k\to 0$ sufficiently fast (by running a line search, for example).