Steepest descent for a quadratic function

gradient descentoptimizationquadratic-forms

Let us focus on the special case of a convex quadratic function
$$
f(\textbf{x}) = \frac{1}{2}\textbf{x}^\top A\textbf{x} + \textbf{b}^\top x
$$

where $A$ is a symmetric positive matrix and $b \in \mathbb{R}^\textbf{d}$
$$
\textbf{g}_k := \nabla f(\textbf{x}^{(k)}) = A\textbf{x}^{(k)}+\textbf{b}
$$

In this case the optimial step size in direction $-\textbf{g}_k$ can be obtained analytically by solving
\begin{align*}
0&= \frac{d}{d\lambda}f(\textbf{x}^{(k)}-\lambda \textbf{g}_k) \qquad (1)\\
& = \langle A(\textbf{x}^{(k)}-\lambda \textbf{g}_k) + \textbf{b}, -\textbf{g}_k \rangle \qquad (2)\\
&=-\|\textbf{g}_k\|^2 + \lambda \langle A\textbf{g}_k, \textbf{g}_k \rangle \qquad (3)
\end{align*}

so that finally
$$
\boxed{\lambda = \frac{\|\textbf{g}_k\|^2}{\textbf{g}_k^\top A\textbf{g}_k}}
$$

Given $\textbf{x}^{(o)}$, the Steepest Descent algorithm is then completely defined by
$$
\boxed{\textbf{x}^{(k+1)} = \textbf{x}^{(k)}- \frac{\|\textbf{g}_k\|^2}{\textbf{g}_k^\top A\textbf{g}_k}\textbf{g}_k}
$$

My questions

  • If I want to go from (1) to (2) I tried to plug $\textbf{g}_k$ into (1) and got

$$0 = \frac{d}{d\lambda}f(\textbf{x}^{(k)}-\lambda( A\textbf{x}^{(k)}+\textbf{b}))$$

  • I also got trouble to go from (2) to (3). Here is the attempt:

\begin{align*}
0 &= \langle A(\textbf{x}^{(k)}-\lambda \textbf{g}_k) + \textbf{b}, -\textbf{g}_k \rangle\\
&= (A(\textbf{x}^{(k)})^\top\textbf{g}_k + \lambda \textbf{g}_k^\top\textbf{g}_k + \textbf{b}^\top\textbf{g}_k \\
\end{align*}

Edit
Here is the development

\begin{align*}
0&= \frac{d}{d\lambda}f(\textbf{x}^{(k)}-\lambda \textbf{g}_k)\\
0&= \frac{d}{d\lambda}f(\underbrace{ \textbf{x}^{(k)} – \lambda(\textbf{A}\textbf{x}^{(k)} + \textbf{b})}_{y(\lambda)})\\
&\text{ hence } \frac{d\textbf{y}}{d\lambda} = -(\textbf{A}\textbf{x}^{(k)} + \textbf{b}) = -\textbf{g}_k\\
&\text{ and } \frac{df}{dy} = \nabla f(\textbf{y}) = \textbf{A}\textbf{y} + \textbf{b}\\
&\text{ so } \frac{d\textbf{f}}{d \textbf{y}} \frac{d\textbf{y}}{d\lambda} = \langle \textbf{A}\textbf{y} + \textbf{b}, -\textbf{g}_k\rangle \\
\text{plugging the definition of } y &= \frac{d\textbf{f}}{d \textbf{y}} \frac{d\textbf{y}}{d\lambda} = \langle \textbf{A}\textbf{x}^{(k)} – \lambda \underbrace{(\textbf{A}\textbf{x}^{(k)} + \textbf{b})}_{\textbf{g}_k} + \textbf{b}, -\textbf{g}_k\rangle \\
& = \langle \textbf{A}(\textbf{x}^{(k)}-\lambda \textbf{g}_k) + \textbf{b}, -\textbf{g}_k \rangle \\
\text{rearranging } & = \langle \underbrace{\textbf{A}\textbf{x}^{(k)} + \textbf{b}}_{\textbf{g}_k} -\lambda\textbf{A} \textbf{g}_k , -\textbf{g}_k \rangle \\
&= -\|\textbf{g}_k\|^2 + \lambda \langle A\textbf{g}_k, \textbf{g}_k \rangle
\end{align*}

Best Answer

Use chain rule. In 1D $$\frac{df(y(\lambda))}{d\lambda}=\frac{df(y)}{dy}\frac{dy(\lambda)}{d\lambda}$$ In the case of $i$ dimensional spaces you get $$\frac{df({\bf y}(\lambda))}{d\lambda}=\sum_i(\nabla f(\mathbf y))_i\frac{dy_i(\lambda)}{d\lambda}$$ You can write this last term as $<\nabla f(\mathbf y)\frac{d\mathbf y}{d\lambda}>$

In your case $$\mathbf y(\lambda)=\textbf{x}^{(k)}-\lambda( A\textbf{x}^{(k)}+\textbf{b})$$ Then $$\frac{d\mathbf y}{d\lambda}=-(A\textbf{x}^{(k)}+\textbf{b})=-\mathbf g_k$$ and:$$\nabla f(\mathbf y)=A\mathbf y+\mathbf b$$ Just put them together and you get (2).

For (3) you have some small errors on your last line. The first and last term each have a minus sign, and the middle term should have an $A$ in front. Grouping together the first and last term will give you the answer.

Related Question