Gradient of squared norm

vector analysis

I am looking for the gradient of the function

$$
f(x)= \dfrac{1}{2} \Vert A^Tx \Vert^2 – b^Tx \, .
$$

Well so far I came up with

$$
\nabla f(x) = AA^Tx-b
$$

because

$$
\begin{aligned}
f(x) &= \dfrac{1}{2} \Vert A^Tx \Vert^2 – b^Tx \\
&= \dfrac{1}{2} (A^Tx)^T(A^Tx) – b^Tx \\
&= \dfrac{1}{2} x^TAA^Tx – b^Tx \\
\end{aligned}
$$

and therefore

$$
\begin{aligned}
\nabla f(x) &= \dfrac{1}{2} (AA^T + (AA^T)^T)x – b \\
&= \dfrac{1}{2} (AA^T + AA^T)x – b \\
&= \dfrac{1}{2} \cdot 2 \cdot AA^T x – b \\
&= AA^T x – b \\
\end{aligned}
$$

Is my reasoning correct? Is there a shorter way to calculate the gradient?

Best Answer

We can compute in the following way as well, using the identity $\|x-y\|^2 = \|x\|^2 - 2\langle x,y\rangle +\|y\|^2$:

$$\begin{align}f(x+h) - f(x) &= \frac{1}{2}\|A^T(x + h)\|^2 - \frac{1}{2}\|A^Tx\|^2 -b^T(x+h) + b^Tx\\ &= \langle A^Tx,A^T h\rangle - b^T h\\ &= \langle AA^Tx - b,h\rangle\end{align}$$ so the gradient $\nabla f(x)$ is $AA^Tx - b$.

Related Solutions

[Math] Chain rule for multivariable gradients – a matrix of gradients

You’re getting tripped up by a misconception that Git Gud pointed out in a comment: that the derivative of a multivariable function is always a vector. Although this is true when the function’s codomain is one-dimensional, which might be all that you’ve been exposed to until now, it’s not generally true for vector-valued functions.

Assuming the “bulk” form of the chain rule that you’ve cited, we have, as you say, $\nabla\phi = \nabla f\nabla g$. Looking at this in purely algebraic terms, $\nabla\phi$ is a $1\times2$ matrix (a vector) as is $\nabla f$, so there are only two possibilities for $\nabla g$: it’s either a scalar or a $2\times2$ matrix. Multiplication by a scalar is the same as multiplication by a multiple of the identity matrix, so both of these cases can be combined into one.

What does this $2\times2$ matrix look like? Expanding the components of $\nabla\phi$ we have $$\begin{align}\nabla\phi = \begin{pmatrix}{\partial\phi\over\partial x} & {\partial\phi\over\partial y} \end{pmatrix} &= \begin{pmatrix} {\partial f\over\partial u}{\partial u\over\partial x}+{\partial f\over\partial v}{\partial v\over\partial x} & {\partial f\over\partial u}{\partial u\over\partial y}+{\partial f\over\partial v}{\partial v\over\partial y} \end{pmatrix} \\ &= \begin{pmatrix}{\partial f\over\partial u} & {\partial f\over\partial v} \end{pmatrix} \begin{pmatrix} {\partial u\over\partial x} & {\partial u\over\partial y} \\ {\partial v\over\partial x} & {\partial v\over\partial y} \end{pmatrix}.\end{align}\tag{*}$$ Comparing this to the product $\nabla\phi = \nabla f\nabla g$ we can see that $\nabla g$ must be the $2\times2$ matrix of partial derivatives in the last line above. This matrix of partial derivatives is known as the Jacobian matrix of $g$ and one can think of the gradient of a scalar-valued function as a special case of this.

The first row of the above matrix is $\nabla u$ and the second row $\nabla v$, so we can write $$\nabla g = \begin{pmatrix}\nabla u\\\nabla v\end{pmatrix}.$$ This point of view in which the rows or columns of a matrix are treated as individual row/column vectors is quite common, for instance when we talk about the row and column spaces of a matrix or describe the columns of a transformation matrix as the images of the basis vectors. The product of a row vector and matrix, such as we have in (*) can be viewed as a linear combination of the rows of the matrix with coefficients given by the vector, i.e., $$\nabla\phi = \nabla f\nabla g = {\partial f\over\partial u}\nabla u+{\partial f\over\partial v}\nabla v.$$

Another way to see why $\nabla g$ is a matrix is to go back to the definition of the differential of a function as the linear map that best approximates the change in the function’s value near a point: $$f(\mathbf v+\mathbf h)-f(\mathbf v) = \operatorname{d}f_{\mathbf v}[\mathbf h]+o(\mathbf h).$$ From this definition, we see that if, say, $f:\mathbb R^n\to\mathbb R^m$, then $\operatorname{d}f_{\mathbf v}:\mathbb R^n\to\mathbb R^m$, too. (Technically, the domain of $\operatorname{d}f_{\mathbf v}$ is the tangent space at $\mathbf v$, but we can gloss over that when working in $\mathbb R^n$.) This means that $\operatorname{d}f_{\mathbf v}$, when expressed in coordinates, is an $m\times n$ matrix, the Jacobian matrix of $f$, in fact. Now, strictly speaking $\operatorname{d}f_{\mathbf v}$ and the derivative of $f$ aren’t quite the same thing—for $f:\mathbb R^n\to\mathbb R$, for instance, $\operatorname{d}f_{\mathbf v}$ is a linear functional that lives in the dual space $\mathbb (R^n)^*$ while $\nabla f(\mathbf v)$ is a vector in $\mathbb R^n$—but with a suitable choice of coordinate systems we can blithely ignore this distinction, identify them and say that $\nabla g$ is also $2\times2$ matrix.

The general expression for the gradient of $\dfrac{1}{2} x^TA(x)x$

I assume you mean $A: \mathbb R^n \to \mathbb R^{n\times n}$ so that $A$ takes in a vector and gives a matrix. I'll operate under this assumption.

We see $$f(x) = \frac 1 2\sum_{i,j = 1}^n x_ix_jA_{ij}(x).$$ Now fix $k \in \{ 1,\ldots, n\}$, Then by the product rule $$\frac{\partial f }{\partial x_k}(x) = \frac{1}{2} \sum^n_{i,j=1} \left( x_ix_j\frac{\partial A_{ij}}{\partial x_k}(x) + \delta_{ik} x_j A_{ij}(x) + \delta_{jk} x_i A_{ij}(x)\right) $$ where $\delta_{ab} = \left\{\begin{smallmatrix} 1, & a = b, \\ 0, & a \neq b.\end{smallmatrix} \right.$ Resolving these $\delta$'s, we see $$\frac{\partial f }{\partial x_k}(x) = \frac{1}{2} \left(\sum^n_{i,j=1} x_ix_j\frac{\partial A_{ij}}{\partial x_k}(x) \right) + \frac 1 2 \left( \sum_{j=1}^n x_j A_{kj}(x)\right) + \frac 1 2\left( \sum^n_{i=1} x_i A_{ik}(x)\right).$$ Now the latter two terms turn into $\frac 1 2(A(x) + A^T(x))x$ when you put everything together. For the first term, you sort of need to invent notation (actually we'll use tensor notation), and what you have is $$\frac 1 2 x^T(\nabla \circ A)(x)x$$ where $\nabla \circ A: \mathbb R^n \to \mathbb R^{n\times n \times n}$ is given by $(\nabla \circ A)_{ijk}(x) = \frac{\partial A_{ij}}{\partial x_k}(x).$ Note that $x^T (\nabla\circ A)(x) x \in \mathbb R^n$ for any $x \in \mathbb R^n$ and by convention the inner products operate on the first two dimensions of $\nabla \circ A$. Then you can write $$\nabla f(x) = \frac 1 2 (x^T(\nabla\circ A)(x) x + A(x)x + A^T(x) x).$$

Best Answer

Related Solutions

[Math] Chain rule for multivariable gradients – a matrix of gradients

The general expression for the gradient of $\dfrac{1}{2} x^TA(x)x$

Related Question