Gradient/Steepest Descent: Solving for a Step Size That Makes the Directional Derivative Vanish

gradient descentmachine learningreal-analysisvector analysis

The following excerpt is from chapter 4.3 of Deep Learning, by Goodfellow, Bengio, and Courville:

enter image description here

The authors state that sometimes we can solve for the step size that makes the directional derivative vanish. If I'm not mistaken, for a function to "vanish" means that it equals $0$ at some point. But in order to have $\nabla_\mathbf{x} f(\mathbf{x}) = 0$, we would, obviously, just have to find the point(s) at which the directional derivative $\nabla_\mathbf{x} f(\mathbf{x}) = 0$ (a critical point); the step size (the value of $\epsilon$) cannot influence whether $\nabla_\mathbf{x} f(\mathbf{x}) = 0$, since it is simply a scalar multiple. Given this fact, I don't understand what is meant by "solve for a step size that makes the directional derivative vanish"?

I'm sure that I'm misunderstanding something here, so I would appreciate it if people could please take the time to clarify this.

Best Answer

First, you're right, "to vanish" means "to be(come) zero".

You seem to be confusing the gradient and the directional derivative.

The gradient $\nabla_\mathbf xf(\mathbf x)$ is a property of the function $f$ at the point $\mathbf x$. (The argument $\mathbf x$ in parentheses specifies the point $\mathbf x$ at which the gradient is taken, whereas the subscript $\mathbf x$ on the nabla operator specifies the variable $\mathbf x$ with respect to which the gradient is taken.)

The directional derivative $\frac{\partial f(\mathbf x)}{\partial \mathbf n}$ is the derivative of the function $f(\mathbf x)$ along the direction specified by a unit vector $\mathbf n$. It's defined by

$$ \frac{\partial f(\mathbf x)}{\partial \mathbf n}=\lim_{\epsilon\to0}\frac{f(\mathbf x+\epsilon\mathbf n)-f(\mathbf x)}\epsilon\;. $$

The connection between the two is that (under suitable differentiability conditions)

$$\frac{\partial f(\mathbf x)}{\partial \mathbf n}=\mathbf n\cdot\nabla_\mathbf xf(\mathbf x)\;.$$

Since the directional derivative is the scalar product of the direction vector and the gradient, the directional derivative is greatest in the direction of the gradient. With the unit vector $\mathbf g=\frac{\nabla_\mathbf xf(\mathbf x)}{\|\nabla_\mathbf xf(\mathbf x)\|}$, we have

$$\frac{\partial f(\mathbf x)}{\partial \mathbf g}=\mathbf g\cdot\nabla_\mathbf xf(\mathbf x)=\frac{\nabla_\mathbf xf(\mathbf x)\cdot\nabla_\mathbf xf(\mathbf x)}{\|\nabla_\mathbf xf(\mathbf x)\|}=\|\nabla_\mathbf xf(\mathbf x)\|\;.$$

The text you quote isn't saying that you can choose the step size such as to make the gradient vanish, but rather such as to make the directional derivative vanish. It doesn't say which directional derivative, but it's implied that it's the directional derivative along the search direction. Think of this as setting off in some direction in the mountains and walking downhill straight ahead until you reach a point where you'd start going up again if you continued in that direction – the bottom of the valley, if you will. This isn't the lowest point yet, since you may be able to descend further by changing direction, but you've optimized the height with respect to the direction that you set off in, and at the point that you've reached, the path in that direction is horizontal – even if it leads downhill if you change direction.