Taking the gradient of $f(\mathbf{x}) = \frac{1}{2}\|\mathbf{A} \mathbf{x} – \mathbf{b}\|_2^2$

derivativeslinear algebramatricesmatrix-calculusmultivariable-calculus

Section 4.5 of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says that the gradient of

$$f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} – \mathbf{b}\|_2^2$$

is

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} – \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} – \mathbf{A}^T \mathbf{b}$$

My understanding is that $f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} – \mathbf{b}\|_2^2$ is the square of the Euclidean norm. So we have that

$$\begin{align} f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} – \mathbf{b}\|_2^2 &= \dfrac{1}{2} \left( \sqrt{(\mathbf{A} \mathbf{x} – \mathbf{b})^2} \right)^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} – \mathbf{b})^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} – \mathbf{b})(\mathbf{A} \mathbf{x} – \mathbf{b}) \\ &= \dfrac{1}{2} [ (\mathbf{A}\mathbf{x})(\mathbf{A} \mathbf{x}) – (\mathbf{A} \mathbf{x})\mathbf{b} – (\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2 ] \ \ \text{(Since matrix multiplication is distributive.)} \\ &= \dfrac{1}{2} [(\mathbf{A} \mathbf{x})^2 – 2(\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2] \ \ \text{(Note: Matrix multiplication is not commutative.)} \end{align}$$

It's at this point that I realised that, since we're working with matrices, I'm not really sure how to take the gradient of this. Taking the gradient of $f(\mathbf{x})$ with respect to $\mathbf{x}$, we get something like

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \dfrac{1}{2} [2 (\mathbf{A} \mathbf{x}) \mathbf{A}] – \dfrac{1}{2}[2(\mathbf{A} \mathbf{A} \mathbf{x})\mathbf{b}]$$

So what is the reasoning that leads us to get $\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} – \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} – \mathbf{A}^T \mathbf{b}$? Where did the transposed matrices come from?

I would greatly appreciate it if people would please take the time to clarify this.

Best Answer

We must take the derivative with finesse, and that means we use the chain rule. Note that $f = g \circ h$, where $h(x) = Ax-b$ and $g(u) = (1/2) \|u\|^2$. The derivatives of $h$ and $g$ are $h'(x) = A$ and $g'(u) = u^T$. So by the chain rule $$ f'(x) = g'(h(x)) h'(x) = (Ax-b)^T A. $$ The gradient of $f$ is $$ \nabla f(x) = f'(x)^T = A^T(Ax-b). $$