Gradient Descent: Minimising the Directional Derivative in Direction $\mathbf{u}$

gradient descentmultivariable-calculusoptimizationpartial derivativevector analysis

The following excerpt is from chapter 4.3 of Deep Learning, by Goodfellow, Bengio, and Courville:

enter image description here

I don't understand the following:

  1. What is meant by $\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1$ in $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1}$?

  2. Why is $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} \mathbf{u}^T \nabla_{\mathbf{x}} f(\mathbf{x}) = \min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} ||\mathbf{u}||_2 || \nabla_{\mathbf{x}} f(\mathbf{x})||_2 \cos(\theta)$? I have no idea how the components of the latter expression came about.

  3. The authors states that the factors that do not depend on $\mathbf{u}$ are ignored. But they then state that the expression simplifies to $\min_{\mathbf{u}} \cos(\theta)$, but $\cos(\theta)$ depends on $\theta$ — not $\mathbf{u}$?

  4. I'm not sure that I understand what is meant by the explanation immediately following the above, but this could be due to my not understanding the preceding information.

I would greatly appreciate it if people could please take the time to clarify this.

Best Answer

  1. u is a unit vector in the direction that you want to evaluate the slope. That is why uTu=1 in the minimization.
  2. The statement in your second question is simply the dot product between the u vector and the gradient vector (del f), which is always the two lengths times the cosine of the included angle theta.
  3. One can ignore the two magnitudes because they are fixed values independent of direction, and it is the relative directions of the two vectors that define theta.