Chain Rule in Matrix Derivatives

chain rulederivativesmatricesmatrix-calculus

I am trying to understand how matrix derivatives work, focusing myself on the chain rule.

Consider $g(U): \mathbb{R}^{N\text{x}N} \rightarrow \mathbb{R}$, and $U=f(X) : \mathbb{R}^{N\text{x}N} \rightarrow \mathbb{R}^{N\text{x}N}$

Then applying the chain rule I know that:

$\frac{\partial g(U)}{\partial X_{ij}}=\text{Tr}[(\frac{\partial g(U)}{\partial U})^T\frac{\partial U}{\partial X_{ij}}]$

However, what happens if $g(U): \mathbb{R}^{N\text{x}N} \rightarrow ^{N\text{x}N}$. I mean if I have to take the derivative of a matrix w.r.t a matrix. This could appear, for instance, if we have that $U=f(Z) : \mathbb{R}^{N\text{x}N} \rightarrow \mathbb{R}^{N\text{x}N}$ and $Z=f(X) : \mathbb{R}^{N\text{x}N} \rightarrow \mathbb{R}^{N\text{x}N}$, as on of the steps of the chain rule will involve the derivative of $U$ w.r.t $Z$

Best Answer

Let's assume that $f$ can be expanded as a power series, i.e. $$\eqalign{ U &= f(X) = \sum_{k=0}^{\infty} \beta_kX^k \\ dU &= \sum_{k=0}^{\infty}\beta_k \sum_{j=0}^{k-1} X^{j}\,dX\,X^{k-j-1} \\ }$$ You've told us nothing about the $g(U)$ function, but let's also assume you know how to calculate its gradient $$G = \frac{\partial g}{\partial U} \quad\implies dg = G:dU$$ where the colon denotes the trace/Frobenius product, i.e. $\;A:B={\rm Tr}(A^TB)$.

Combining these results yields. $$\eqalign{ dg &= G:\sum_{k=0}^{\infty}\beta_k \sum_{j=0}^{k-1} X^{j}\,dX\,X^{k-j-1} \\ &= \sum_{k=0}^{\infty}\beta_k \sum_{j=0}^{k-1} \Big[X^{k-j-1}\,G^T\,X^{j}\Big]^T \,:\,dX \\ \frac{\partial g}{\partial X} &= \sum_{k=0}^{\infty}\beta_k \sum_{j=1}^{k-1} \Big[X^{k-j-1}\,G^T\,X^{j}\Big]^T \\ }$$ Thus one can calculate the desired gradient without calculating the 4th-order tensor $\,\frac{\partial U}{\partial X}$

Related Question