Chain Rule with for composition of scalar and multivariable functions

calculuschain rulederivativesmatrix-calculusmultivariable-calculus

Suppose I have the following function $f(\bf x)= \bf a^\top \log (\bf A \bf x )$, where $\bf x \in \mathbb{R}^n$ and $\bf A \in \mathbb{R}^{m \times n}$. The $\log$ function is entry wise.
I am considering $f(\cdot)$ to be the composition of three functions: the linear function $g(\bf y)= \bf a^\top \bf y$, the function $y(\bf z)= \log (\bf z)$ and the function $z(\bf x)= \bf A \bf x$.

So I have $f(\bf x)= g(y(z(x)))$. Here I have composition of scalar and multivariable functions, so the general chain rule for scalar function does not hold in general (also the order of the terms changes when vectors are involved). How to properly compute the gradient and the Hessian of such a function? How to handle the entry-wise $\log$ in the chain rule?

Thank you.

Best Answer

$\require{enclose}\def\p#1#2{\frac{\partial #1}{\partial #2}}\def\o#1{\operatorname{#1}}\def\D{\o{Diag}}$ Take an ordinary scalar function $\phi(s)$ and its derivative $\phi′(s)=\frac{d\phi}{ds}$ and apply them element-wise to a vector argument $v$ to generate the associated vector functions $$h=\phi(v),\qquad h'=\phi'(v)$$ The differential of such a vector function can expressed using an elementwise $(\odot)$ product or better yet, a diagonal matrix $$\eqalign{ dh &= h'\odot dv \;=\; \D(h')\,dv\\ }$$ Setting $\;h=\log(v),\;v=Ax,\;V=\D(v)\;$ this becomes $$\eqalign{ dh = d\log(v) &= V^{-1}dv\;=\; V^{-1}A\,dx \\ }$$ Now we're ready to calculate the differential and gradient of the current function. $$\eqalign{ f &= a:h \\ df &= a:dh \\ &= a:V^{-1}A\,dx \\ &= A^TV^{-1}a:dx \\ \p{f}{x} &= A^TV^{-1}a \;=\; g\qquad({\rm gradient}) \\ }$$ The hessian is the gradient of the gradient, so start the calculation with the differential of $g$. $$\eqalign{ dg &= A^T\color{red}{dV^{-1}}a \\ &= A^T\color{red}{(-V^{-1}dV\,V^{-1})}a \\ &= -A^TV^{-2}dV\,a \\ &= -A^TV^{-2}\D(a)\;dv \\ &= -A^TV^{-2}\D(a)\,A\;dx \\ \p{g}{x} &= -A^TV^{-2}\D(a)\,A \;=\; H\qquad({\rm hessian}) \\ }$$


A colon has been used to denote the trace/Frobenius product, i.e. $$\eqalign{ A:B &= {\rm Tr}(A^TB) \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \\ }$$ This definition can also be applied to vectors by treating them as rectangular matrices (set $n=1$) in which case it is equal to the standard dot product.

The terms in a Frobenius product can be rearranged in a number of ways, e.g.
$$\eqalign{ A:B &= B:C \;=\; B^T:A^T \\ CA:B &= A:C^TB \;=\; C:BA^T \\ }$$ due to the properties of the underlying trace function.

Some of the steps in the derivation used certain properties of diagonal matrices. In terms of the vectors {$a,b$} these properties can be written as $$\eqalign{ \D(a)\cdot\D(b) &= \D(b)\cdot\D(a) \\ \D(a)\cdot b &= \D(b)\cdot a \\ }$$

Finally, one of the comments asked about the product rule for differentials. Here is the general rule for arbitrary tensors {$S,T$} and any product {$\star$} with which they are dimensionally compatible. $$\eqalign{ d(S\star T) &= (S+dS)\star(T+dT) \;-\; S\star T \\ &= \big(S\star T +S\star dT +dS\star T +dS\star dT\big) \;-\; S\star T \\ &= S\star dT + dS\star T + (\enclose{horizontalstrike}{dS\star dT}) \\ &= S\star dT + dS\star T \\ }$$

Related Question