Chain rule for matrices in index notation

chain ruleindex-notationmatricesmatrix-calculus

Consider recursive relations
$$
\textbf{H}^t=\sigma(\textbf{Z}^t)
$$

$$
\textbf{Z}^t=\textbf{A}\textbf{H}^{t-1}\textbf{W}^t
$$

where $\textbf{Z}^t\in\mathbb{R}^{m\times n_t}$, $\textbf{A}\in\mathbb{R}^{m\times m}$, $\textbf{H}^{t-1}\in\mathbb{R}^{m\times o_t}$, $\textbf{W}^t\in\mathbb{R}^{o_t\times n_t}$, $\sigma$ is an element-wise function, and $1\leq t\leq T$. Also, let
$$
L=f(\textbf{Z}^T)
$$

where $L\in \mathbb{R}$, and $f$ is a function $f:\mathbb R ^ {m\times n_T}\rightarrow \mathbb R$

Problem:

Given $\textbf{e}^T:=\partial L/\partial \textbf{Z}^T$, derive a recursive relation to compute $\textbf{e}^t:=\partial L/\partial \textbf{Z}^t$ for $1\leq t < T$.


My attempt:

We can write the recursive relation
$$
\frac{\partial L}{\partial \textbf{Z}^{t-1}}=\frac{\partial L}{\partial \textbf{Z}^t} \frac{\partial \textbf{Z}^t}{\partial \textbf{H}^{t-1}} \frac{\partial \textbf{H}^{t-1}}{\partial \textbf{Z}^{t-1}}
$$

We can write this in index notation as
\begin{equation}
\frac{\partial L}{\partial Z^{t-1}_{pq}}=\sum_{i,j}\sum_{k,l}\frac{\partial L}{\partial Z^t_{ij}} \frac{\partial Z^t_{ij}}{\partial H^{t-1}_{kl}} \frac{\partial H^{t-1}_{kl}}{\partial Z^{t-1}_{pq}}
\label{eq:1}
\end{equation}

Writing the equation for $\textbf{Z}^t$ in index notation,
$$
Z_{i j}^t=\sum_{k, l} A_{i k} H_{k l}^{t-1} W_{l j}^t
$$

we can derive
$$
\frac{\partial Z^t_{ij}}{\partial H_{kl}^{t-1}}=A_{ik}W_{lj}^t
$$

Plugging this back into the recursive relation, we can write
$$
e^{t-1}_{pq}=\sum_{i,j}\sum_{k,l}e^{t}_{ij} A_{ik}W_{lj}^t \frac{\partial H^{t-1}_{kl}}{\partial Z^{t-1}_{pq}}
$$


Questions:

  1. Is my index notation representation of the chain rule ($\partial L/\partial Z^{t-1}_{pq}$) correct?
  2. I’m having a hard time deriving $\partial H^{t-1}_{kl}/\partial Z^{t-1}_{pq}$ in index notation and finding the final relation for $e_{pq}^{t-1}$.
  3. I’m also having trouble turning this into matrix representation for $\textbf{e}^{t-1}$.

Best Answer

$ \def\l{\lambda}\def\s{\sigma}\def\sp{\s^\prime} \def\o{{\tt1}}\def\p{\partial} \def\E{{\cal E}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} $Your calculations are correct, however as you have discovered it is often difficult to apply the chain rule in matrix calculus because it requires awkward quantities which are not vectors or matrices but higher-order tensors.

Furthermore, these tensors are not particularly interesting. They are purely calculational devices.

Differentials are a better approach for matrix calculus. The differential of a matrix behaves like a matrix. In particular, it obeys all of the rules of matrix algebra.

While index notation is a very powerful tool, if you're just dealing with scalars, vectors, and matrices it is usually easier to use standard matrix notation. In addition, omitting component indexes allows us to use subscripts as the layer indicator. And omitting all of those $\large\Sigma$ summations is easier on the eyes.

So let's reserve superscripts for standard matrix operations like transposes and inverses. To further minimize confusion with transposes, change the layer indicator from $(t,T)\to(k,K)$

For typing convenience, define the following variables $$\eqalign{ G_k &= \grad{L}{Z_k} &\qiq &dL = G_k:dZ_k \\ S_k &= \sp(Z_k) \\ Z_k &= AH_{k-1}W_k &\qiq &dZ_k = {A\:dH_{k-1}\,W_k} \\ H_k &= \s(Z_k) &\qiq &dH_k = S_k\odot dZ_k \\ G_K &\doteq \c{E_K = {\rm given}} &&\big(\odot = {\rm Hadamard\:product}\big) \\ }$$ Start with the differential of $L$ at the output layer, then back-propagate to prior layers $$\eqalign{ dL &= \c{E_K}:dZ_K \\ &= G_K:dZ_K \\ &= G_K:\LR{A\:dH_{K-1}\,W_K} \\ &= \LR{A^TG_KW_K^T}:dH_{K-1} \\ &= \LR{A^TG_KW_K^T}:\LR{S_{K-1}\odot dZ_{K-1}} \\ &= \LR{\LR{A^TG_KW_K^T}\odot S_{K-1}}:dZ_{K-1} \\ &= G_{K-1}:dZ_{K-1} \\ &= \vdots\quad\vdots\quad\ \quad\vdots\quad\vdots \qquad\{{\rm repeat\:steps}\} \\ &= \LR{\LR{A^TG_{K-1}W_{K-1}^T}\odot S_{K-2}}:dZ_{K-2} \\ &= G_{K-2}:dZ_{K-2} \\ &= etc \\ }$$ Therefore the gradient of $L$ at the $k^{th}$ layer is $$\eqalign{ G_k &= \grad{L}{Z_k} = \LR{A^TG_{k+1}W_{k+1}^T}\odot S_{k} \\\\ }$$


In the above, a colon is used to denote the Frobenius product, which is a concise notation for the trace (or a double sum in index notation) $$\eqalign{ A:B &= \trace{A^TB} \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \\ A:A &= \|A\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ Furthermore, the Frobenius and Hadamard products commute $$\eqalign{ \LR{A\odot B}:C &= A:\LR{B\odot C} \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij}C_{ij} \\ }$$

Related Question