[Math] Why does gradient ascent/descent exhibit zig-zag motion

convex optimizationgradient descentnumerical optimizationoptimization

A good way to visualize gradient ascent/descent is to assume you are in a quadratic bowl or on a mountain. If I visualize this, then the direction of steepest ascent/descent is the one that points straight towards the bottom of the bowl or top of the mountain.

With this understanding I have two questions:

  1. If you want to climb the hill or go down a bowl, why take a zig-zag path instead of taking a straight path to the top/bottom?

  2. Why doesn't the steepest path have a unit in the z direction? I understand that gradient is orthogonal to the level sets of the function. That is it lies in the x-y plane orthogonal to the contour. But why doesn't it have a unit in the z direction? With a unit in the z direction, it can point towards the minima/maxima and still be orthogonal to contour lines.

I have a related question: Gradient is NOT the direction that points to the minimum or maximum

Best Answer

Suppose we have a $n \times n$ symmetric, positive definite matrix $\rm Q$. We define the following (convex) quadratic cost function

$$f (\mathrm x) := \frac 12 \mathrm x^\top \mathrm Q \,\mathrm x$$

whose gradient is $\nabla f (\mathrm x) = \mathrm Q \mathrm x$. Using gradient descent with step size $\mu > 0$,

$$\begin{array}{rl} \mathrm x_{k+1} &= \mathrm x_k - \mu \nabla f (\mathrm x_k)\\ &= \mathrm x_k - \mu \mathrm Q \mathrm x_k\\ &= \left( \mathrm I_n - \mu \mathrm Q \right) \mathrm x_k\end{array}$$

Let $\rm Q = V \Lambda V^\top$ be a spectral decomposition of $\rm Q$. Let $\eta_k := \mathrm V^\top \mathrm x_k$. After a bit of work, we obtain

$$\eta_{k+1} = \left( \mathrm I_n - \mu \Lambda \right) \eta_k$$

Note that if

$$0 < \mu \leq \frac{1}{\lambda_{\max} (\mathrm Q)}$$

then all the entries on the main diagonal of $\mathrm I_n - \mu \Lambda$ are in $[0,1)$ and, thus, no zig-zag behavior does occur. However, if

$$\frac{1}{\lambda_{\max} (\mathrm Q)} < \mu \leq \frac{2}{\lambda_{\max} (\mathrm Q)}$$

then at least one of the entries on the main diagonal of $\mathrm I_n - \mu \Lambda$ will be in $[-1,0)$ and, thus, zig-zagging will occur. In short, zig-zagging arises when the step size $\mu$ is not chosen properly.