[Math] Numerator layout for derivatives and the chain rule

matricesmatrix-calculus

We have three matrices $\mathbf{W_2}$, $\mathbf{W_1}$ and $\mathbf{h}$ (technically a column vector):

$$
\mathbf{W_1} =
\begin{bmatrix}
a & b \\
c & d \\
\end{bmatrix}
\;\;\;\;\;\;\;\;\;
\mathbf{W_2} =
\begin{bmatrix}
e & f \\
\end{bmatrix}
\;\;\;\;\;\;\;\;\;
\mathbf{h} =
\begin{bmatrix}
h_1 \\
h_2 \\
\end{bmatrix}
$$

And a scalar $y$, where:

$$
y = \mathbf{W_2} \mathbf{W_1} \mathbf{h}
$$

I'd like to compute the derivative of $y$ with respect to $\mathbf{W_1}$, assuming numerator layout.

Using the chain rule:

$$
y = \mathbf{W_2} \mathbf{u}
\;\;\;\;\;\;\;\;\;
\mathbf{u} = \mathbf{W_1} \mathbf{h} \\
$$

$$
\begin{align}
\frac{\partial y}{\partial \mathbf{W_1}} &=
\frac{\partial y}{\partial \mathbf{u}} \frac{\partial \mathbf{u}}{\partial \mathbf{W_1}} \\
&= \mathbf{W_2} \frac{\partial \mathbf{u}}{\partial \mathbf{W_1}} \\
&= \mathbf{W_2} \mathbf{h}^{\top} \\
\end{align}
$$

All well and good. Except – this isn't a $2×2$ matrix!! In fact, the dimensions don't match up for matrix multiplication, so something must be incorrect.

If we take the Wikipedia definition of the derivative of a scalar by a matrix, using numerator layout, we know that actually:

$$
\frac{\partial y}{\partial \mathbf{W_1}} =
\begin{bmatrix}
\frac{\partial y}{\partial a} & \frac{\partial y}{\partial c} \\
\frac{\partial y}{\partial b} & \frac{\partial y}{\partial d} \\
\end{bmatrix}
$$

Each element is just a scalar derivative, which we can calculate without any matric calculus. If we do that by hand and then factorise, we end up with:

$$
\frac{\partial y}{\partial \mathbf{W_1}} = \mathbf{h} \mathbf{W_2}
$$

Clearly, $\mathbf{h} \mathbf{W_2} \neq \mathbf{W_2} \mathbf{h}^\top $.

Can anybody suggest where I went wrong?

Best Answer

$\def\p#1#2{\frac{\partial #1}{\partial #2}}$ Define the trace/Frobenius product as $$A:B \;=\; {\rm Tr}(A^TB) \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij}$$ Using this product eliminates a whole category of transposition errors, which arise in other approaches.

The properties of the trace allow such products to be arranges in many equivalent ways, e.g. $$\eqalign{ A:C &= C:A &= C^T:A^T \\ AB:C &= A:CB^T &= B:A^TC \\ }$$ Note that the matrices on the LHS and RHS of the colon must have the same dimensions.
The Frobenius product is similar to the Hadamard product in this respect.

Let's define some variables without the distracting subscripts $$W = W_1, \qquad w=W_2^T$$ Write the scalar function in terms of these new variables. Then calculate its differential and gradient. $$\eqalign{ y &= w:Wh \\&= wh^T:W \\ dy &= wh^T:dW \\ \p{y}{W} &= wh^T \\ }$$ The dimensions of this result equal the dimensions of the $W$ matrix, expressed as the outer product of two column vectors.

Writing this in terms of the original variables $$\eqalign{ \p{y}{W_1} &= W_2^Th^T \\ }$$