Chain rule for vector by vector derivative

chain rulederivativesmatrix analysismatrix-calculusVector Fields

I think it's clear that:
\begin{equation}
\frac{d(\mathbf{A} \mathbf{x})}{d \mathbf{x}}=\mathbf{A}, \quad \text{ where $\mathbf{A}$ is a matrix and $\mathbf{x}$ is a vector}.
\end{equation}

but if we had an vector valued function $f: \mathbb{R}^n \rightarrow \mathbb{R}^n $, what can we say about the following derivative:
\begin{equation}
\frac{d\left(f(\mathbf{A} \mathbf{x})\right)}{d \mathbf{x}}= \text{?}
\end{equation}

For scalar valued univariate functions we know that:
\begin{equation}
\frac{d(g(ax))}{d(x)} = \frac{d(g(ax))}{d(ax)}\frac{d(ax)}{dx} = g'(ax) a
\end{equation}

In other words, what is the chain rule for vector by vector derivatives? Is it something like the following?
\begin{equation}
\frac{d\left(f(\mathbf{A} \mathbf{x})\right)}{d \mathbf{x}}= \left(\frac{d(\mathbf{Ax})}{d\mathbf{x}}\right)^T \frac{d\left(f(\mathbf{A} \mathbf{x})\right)}{d (\mathbf{Ax})} = \mathbf{A^T} f'(\mathbf{Ax})
\end{equation}

Best Answer

If you have functions $f: \mathbb{R}^{n} \longrightarrow \mathbb{R}^{m}$ and $g: \mathbb{R}^{k} \longrightarrow \mathbb{R}^{n}$, the chain rule behaves just the same as in the scalar case, as mentioned in the comments: the derivative of the function $f\circ g: \mathbb{R}^{k} \longrightarrow \mathbb{R}^{m}$ is given by $$(f \circ g)'(x) = f'\big(g(x)\big) \cdot g'(x).$$ Only now you have to take into account that the $\cdot$ denotes composition of the respective derivatives which are linear transformations: $$(f \circ g)'(x): \mathbb{R}^{k} \longrightarrow \mathbb{R}^{m}$$ $$f'\big(g(x)\big): \mathbb{R}^{n} \longrightarrow \mathbb{R}^{m}$$ $$g'(x): \mathbb{R}^{k} \longrightarrow \mathbb{R}^{n}$$ You can also fix bases and think of these derivatives as matrices, in which case $(f \circ g)'(x)$ is $m\times k$, $f'\big(g(x)\big)$ is $m\times n$, and $g'(x)$ is $n\times k$; as you can verify, such product of matrices makes sense.

Now, in your case, $n = m = k$ and $g(x) = Ax$ is a linear transformation. For linear transformations $g'(x) = g$, for any $x\in \mathbb{R}^{n}$. This states that the best linear approximation of a linear transformation, near $x$, is the linear transformation itself, which is quite intuitive. Back to the chain rule, for $g(x)=Ax$, we have $$(f \circ g)'(x) = f'(Ax) \cdot A.$$

I am not sure how you would get a transpose, but some references use different conventions and perhaps your $\mathrm{d}/\mathrm{d}x$ notation means something else (some kind of gradient?). For instance, for $m=1$ and $f: \mathbb{R}^{n} \longrightarrow \mathbb{R}$, the gradient $\nabla f(x)$ is defined as the unique vector that satisfies $$f'(x)[v] = \langle v, \nabla f(x) \rangle.$$ Here, we denote by $f'(x)[v]$ the linear functional $f'(x)$ applied to the vector $v \in \mathbb{R}^n$, which gives a number. In this case, $$(f \circ g)'(x) = f'(Ax) \cdot A$$ while $$\nabla(f \circ g)(x) = A^T \nabla f (Ax).$$