Derivation of the steepest direction

gradient descentinequalityoptimization

The basic idea behind the derivation of the steepest direction is to find $\textbf{d}$ such that $f(\textbf{x}^{(k)} + \lambda \textbf{d})$ decreases the most for small values of $\lambda$

In other words, finding $\textbf{d} \in \mathbb{S}^{n-1}$ that minimizes
$$
\frac{d}{d\lambda}f(\textbf{x}^{(k)} + \lambda \textbf{d})|_{\lambda = 0} = \langle \nabla f(\textbf{x}^{(k)}, \textbf{d})\rangle \qquad (1)
$$

Now using the Cauchy-Schwarz Inequality, we get
$$
|\langle\nabla f(\textbf{x}^{(k)},\textbf{d} \rangle|^2 \leq \|\nabla f(\textbf{x}^{(k)})\|^2 \qquad (2)
$$

with equality in case $\textbf{d}$ lies on the line spanned by $\nabla f(\textbf{x})^{(k)}$, so that
$$
\textbf{d} = -\frac{1}{\|\nabla f(\textbf{x}^{(k)})\|}\nabla f(\textbf{x}^{(k)}) \qquad (3)
$$

delivers the steepest direction

Here are my questions:

  • The Cauchy-Schwarz Inequality states that
    $$
    \boxed{| \textbf{u}^\top \textbf{v} | \leq \| \textbf{u}\| \|\textbf{v}\|}
    $$

    How can I get (2) with it?

  • Now tackling the way we get to (3) we can start from (2)
    \begin{align*}
    |\langle\nabla f(\textbf{x}^{(k)},\textbf{d} \rangle|^2 &\leq \|\nabla f(\textbf{x}^{(k)})\|^2
    \\
    ((\nabla f(\textbf{x}^{(k)})^\top\textbf{d})^2 &\leq \|\nabla f(\textbf{x}^{(k)})\|^2\\
    \end{align*}

But then again I'm stuck

EDIT
So by setting $\textbf{u} = \nabla f(x^{(k)}$ and $\textbf{v} = d$ with $\|\textbf{d}\| = 1$ as it is $\in \mathbb{S}^{n-1}$, we can plug those 2 values into the CS inequality which gives

\begin{align*}
\langle \nabla f(\textbf{x}^{(k)}), \textbf{d} \rangle &\leq \|\nabla f(\textbf{x}^{(k)})\| \|\textbf{d}\|\\
\langle \nabla f(\textbf{x}^{(k)}), \textbf{d} \rangle &\leq \|\nabla f(\textbf{x}^{(k)})\| (1)\\
|\langle \nabla f(\textbf{x}^{(k)}), \textbf{d} \rangle|^2 &\leq \|\nabla f(\textbf{x}^{(k)})\|^2\\
\end{align*}

now setting
$$
\textbf{d} = -\frac{\nabla f(\textbf{x}^{(k)}) }{\|\nabla f(\textbf{x}^{(k)})\|}
$$

We get
\begin{align*}
|\langle \nabla f(\textbf{x}^{(k)}), -\frac{\nabla f(\textbf{x}^{(k)}) }{\|\nabla f(\textbf{x}^{(k)})\|} \rangle|^2 &\leq \|\nabla f(\textbf{x}^{(k)})\|^2\\
\frac{\|\nabla f(\textbf{x}^{(k)})\|^4}{\|\nabla f(\textbf{x}^{(k)})\|^2} &\leq \|\nabla f(\textbf{x}^{(k)})\|^2\\
\|\nabla f(\textbf{x}^{(k)})\|^2 &= \|\nabla f(\textbf{x}^{(k)})\|^2\\
\end{align*}

Best Answer

Teh derivation of (2) starts with the assumption (which the steps you presented did not state explicitly) that $\mathbf{d} = 1$. With this added information, let $ u = \nabla f(\mathbf{x}^{(k)})$ and $v = \mathbf{d}$. Then CS says $$ |u \cdot v|^2 \leq |v!^2 = |u|^2 (1) = ||\nabla f(\mathbf{x}^{(k)})||^2 $$

From (2) you know that whatever direction is chosen for $\mathbf{d}$, the LHS is no greater than $||\nabla f(\mathbf{x}^{(k)})|||^2$. Therefore if you can find a direction (a value of $\mathbf{d}$) such that equality holds, this is a direction (perhaps there could be others) maximizing the LHS.

$\nabla\mathbf{x}^{(k)}$ is one such value of $\mathbf{d}$ that give equality in (2).

Since the LHS gives the absolute value of the slope of the function along $\mathbf{d}$, to minimize the slope you need to choose $\mathbf{d}$ proportional to $\nabla\mathbf{x}^{(k)}$ but such that

  • The magnitude of $\mathbf{d}$ is $1$, and
  • $\nabla f(\mathbf{x}^{(k)})$ is negative.

Expression (3) is the unique expression having those properties.

Related Question