[Math] Chain rule for multivariable gradients – a matrix of gradients

calculusmultivariable-calculusvector analysis

In my coursebook, there was a function to be differentiated.

Its definition was: $$\varphi(x,y) = f(u(x,y), v(x,y)) $$ where $$f(u(x,y), v(x,y)) \in \mathbb R$$
This function is clearly a composition:
$$(x,y) \mapsto_{g} (u(x,y), v(x,y)) \mapsto_f \varphi $$
Therefore, to calculate its derivative we need to apply the chain rule. On the one hand, the derivative is:
$$\nabla \varphi = \left(\frac {\partial \varphi}{\partial x}, \frac{\partial \varphi}{\partial y} \right)$$
On the other hand, the derivative of $\varphi$ is:
$$f' \cdot g' = \nabla f \cdot \nabla g$$
Now, here comes the part which I do not understand. I know that the derivative of a multivaiable function is a vector, therefore, I expect $\nabla g$ to be a vector looking something like this:
$$\nabla g = \left(\frac{\partial g}{\partial u}, \frac{\partial g}{\partial v}
\right)$$
However, the textbook presentes $\nabla g$ as a matrix looking exactly like this:
$$ \nabla g = \begin{bmatrix}\nabla u \\ \nabla v \end{bmatrix} $$
I do not understand why it is possible to present one gradient as a matrix of other gradients. Could you, please, clarify a little bit what is going on here?

Best Answer

You’re getting tripped up by a misconception that Git Gud pointed out in a comment: that the derivative of a multivariable function is always a vector. Although this is true when the function’s codomain is one-dimensional, which might be all that you’ve been exposed to until now, it’s not generally true for vector-valued functions.

Assuming the “bulk” form of the chain rule that you’ve cited, we have, as you say, $\nabla\phi = \nabla f\nabla g$. Looking at this in purely algebraic terms, $\nabla\phi$ is a $1\times2$ matrix (a vector) as is $\nabla f$, so there are only two possibilities for $\nabla g$: it’s either a scalar or a $2\times2$ matrix. Multiplication by a scalar is the same as multiplication by a multiple of the identity matrix, so both of these cases can be combined into one.

What does this $2\times2$ matrix look like? Expanding the components of $\nabla\phi$ we have $$\begin{align}\nabla\phi = \begin{pmatrix}{\partial\phi\over\partial x} & {\partial\phi\over\partial y} \end{pmatrix} &= \begin{pmatrix} {\partial f\over\partial u}{\partial u\over\partial x}+{\partial f\over\partial v}{\partial v\over\partial x} & {\partial f\over\partial u}{\partial u\over\partial y}+{\partial f\over\partial v}{\partial v\over\partial y} \end{pmatrix} \\ &= \begin{pmatrix}{\partial f\over\partial u} & {\partial f\over\partial v} \end{pmatrix} \begin{pmatrix} {\partial u\over\partial x} & {\partial u\over\partial y} \\ {\partial v\over\partial x} & {\partial v\over\partial y} \end{pmatrix}.\end{align}\tag{*}$$ Comparing this to the product $\nabla\phi = \nabla f\nabla g$ we can see that $\nabla g$ must be the $2\times2$ matrix of partial derivatives in the last line above. This matrix of partial derivatives is known as the Jacobian matrix of $g$ and one can think of the gradient of a scalar-valued function as a special case of this.

The first row of the above matrix is $\nabla u$ and the second row $\nabla v$, so we can write $$\nabla g = \begin{pmatrix}\nabla u\\\nabla v\end{pmatrix}.$$ This point of view in which the rows or columns of a matrix are treated as individual row/column vectors is quite common, for instance when we talk about the row and column spaces of a matrix or describe the columns of a transformation matrix as the images of the basis vectors. The product of a row vector and matrix, such as we have in (*) can be viewed as a linear combination of the rows of the matrix with coefficients given by the vector, i.e., $$\nabla\phi = \nabla f\nabla g = {\partial f\over\partial u}\nabla u+{\partial f\over\partial v}\nabla v.$$

Another way to see why $\nabla g$ is a matrix is to go back to the definition of the differential of a function as the linear map that best approximates the change in the function’s value near a point: $$f(\mathbf v+\mathbf h)-f(\mathbf v) = \operatorname{d}f_{\mathbf v}[\mathbf h]+o(\mathbf h).$$ From this definition, we see that if, say, $f:\mathbb R^n\to\mathbb R^m$, then $\operatorname{d}f_{\mathbf v}:\mathbb R^n\to\mathbb R^m$, too. (Technically, the domain of $\operatorname{d}f_{\mathbf v}$ is the tangent space at $\mathbf v$, but we can gloss over that when working in $\mathbb R^n$.) This means that $\operatorname{d}f_{\mathbf v}$, when expressed in coordinates, is an $m\times n$ matrix, the Jacobian matrix of $f$, in fact. Now, strictly speaking $\operatorname{d}f_{\mathbf v}$ and the derivative of $f$ aren’t quite the same thing—for $f:\mathbb R^n\to\mathbb R$, for instance, $\operatorname{d}f_{\mathbf v}$ is a linear functional that lives in the dual space $\mathbb (R^n)^*$ while $\nabla f(\mathbf v)$ is a vector in $\mathbb R^n$—but with a suitable choice of coordinate systems we can blithely ignore this distinction, identify them and say that $\nabla g$ is also $2\times2$ matrix.

Best Answer

Related Solutions

[Math] Chain rule for the curl of a vector-valued function

Multivariable chain rule on quadratic cost function

Update

Related Question