Neural Networks – Rank of Gradient of Loss with Respect to Layer Weights in an MLP

gradientperceptronweights

The paper: https://arxiv.org/abs/2110.11309, makes the following claim at the end of page 3:

The gradient of loss $L$ with respect to weights $W_l$ of an MLP is a rank-1 matrix for each of B batch elements $\nabla_{w_l}L = \sum_{i=1}^B \delta_{l+1}^i {u_l^i}^T$, where $\delta_{l+1}^i$ is the gradient of the loss for batch element $i$ with respect to the preactivations at layer $l + 1$, and ${u_l^i}^T$ are the inputs to layer $l$ for batch element i.

Suppose that we have an MLP with $k$ hidden layers (every hidden layer is followed by an activation function). Then the weight matrices will be $W_1, W_2, \dots, W_k$ (plus the biases, but they are irrelevant for now), and their sizes will be $(D_1, D), (D_2, D_1), \dots (D_k, D_{k-1})$ correspondingly, where $D$ is the number of input features.

Therefore, hidden layer $l$ has a weight matrix $W_l$ of size $(D_l, D_{l-1})$. Its gradient wrt the loss (for 1 batch element), $\frac{\partial L}{\partial W_l}$, will also be a matrix of size $(D_l, D_{l-1})$.

So if I understand correctly, the authors of the paper are claiming that $\frac{\partial L}{\partial W_l}$ is a rank-1 matrix? That is, every row (or column) can be expressed as a linear combination of 1 only row (or column)? If yes, why? How?

Best Answer

There is an explanation of this identity in the newly-added Appendix D of the updated paper on OpenReview. I've reproduced it here with minor modifications.

Appendix D: Rank-1 Gradient for MLPs

In the simplified case of an MLP and a batch size of 1, we describe the rank-1 gradient of the loss $L$ with respect to the layer $\ell$ weight matrix $W_\ell$. We define the inputs to layer $\ell$ as $u_\ell$ and the \textit{pre-activation} inputs to layer $\ell + 1$ as $z_{\ell+1}=W_\ell u_\ell$. We define $\delta_{\ell+1}$ as the gradient of $L$ with respect to $z_{\ell+1}$ (we assume that $\delta_{\ell+1}$ is pre-computed, as a result of standard backpropagation). We will show that the gradient of the loss $L$ with respect to $W_\ell$ is equal to $\delta_{\ell+1}u_\ell^\top$.

By the chain rule, the derivative of the loss with respect to weight $W_\ell^{ij}$ is equal to

\begin{equation} \frac{\partial L}{\partial W_\ell^{ij}} = \sum_k\frac{\partial L}{\partial z_{\ell+1}^k}\frac{\partial z_{\ell+1}^k}{\partial W_\ell^{ij}} = \frac{\partial L}{\partial z_{\ell+1}^i}\frac{\partial z_{\ell+1}^i}{\partial W_\ell^{ij}} \end{equation} the product of the derivative of $L$ with respect to next-layer pre-activations $z_{\ell+1}^i$ and the derivative of next-layer pre-activations $z_{\ell+1}^i$ with respect to $W_{ij}$. The second equality is due to the fact that $\frac{\partial z_{\ell+1}^k}{\partial W_\ell^{ij}} = 0$ for $k \neq i$. Noting that $z_{\ell+1}^i = \sum_j u_\ell^j W_\ell^{ij}$, we can replace $\frac{\partial z_{\ell+1}^i}{\partial W_\ell^{ij}}$ with simply $u_\ell^j$ in the above equation. Further, we defined $\delta_{\ell+1}$ to be exactly $\frac{\partial L}{\partial z_{\ell+1}^i}$. Making these two substitutions, we have \begin{equation} \frac{\partial L}{\partial W_\ell^{ij}} = \delta_{\ell+1}^i u_\ell^j \end{equation} or, in vector notation, $\nabla_{W_\ell}L = \delta_{\ell+1}u_\ell^\top$, which is the original identity we set out to prove.

Disclaimer: I'm the first author of the linked paper.