Computing Neural Network Gradients

matrix-calculusmultivariable-calculusneural networkspartial derivativevectors

The following note "Computing Neural Network Gradients" explains how we can take derivate with respect to matrix and vector. I have some questions:

Figure below from the above note shows when we take derivative with respect to column vector we have: $\frac{\partial(WX)}{\partial x} = W$
. Suppose W is $n \times m$ matrix and $x$ is m-dimensional vector.

=============================================================
enter image description here

=============================================================

While in the picture below from Wikipedia, we have $\frac{\partial(AX)}{\partial x} = A^T$(since we take derivative with respect to column vector, I assume the denominator layout is the equivalent format. )

================================================================
enter image description here

================================================================

Q1. I don't understand why in the note it shows $W$ instead of $W^T$. Can anyone help explaining this please?

Next, In the last part of note, the author added :

We see that the dimensions of all the terms in the gradient match up (i.e., the
number of columns in a term equals the number of rows in the next term). This will always be the case if we computed our gradients correctly.
Now we can use the error terms to compute our gradients. Note that we transpose out answers when computing the gradients for column vectors terms to follow the shape convention.

You can see the picture below for further details.

=============================================================
enter image description here

=============================================================

Q2- If we always take the derivative with respect to column vector, why do we need to transpose the result? Rather than matching dimension as a way of verification how can I understand from the beginning when I need to transpose the result when calculating the derivative using the chain rule?

I would appreciate any help to understand this. Thank you.

Best Answer

Note that $$\left(\frac{\partial \mathbf{z}}{\partial\mathbf{x}}\right)_{ij} = \frac{\partial z_i}{\partial x_j}$$ is numerator layout, not denominator layout. This is because the "column"-ness of $\mathbf{z}$ is preserved (e.g., when $\mathbf{z}$ is a column vector then $\frac{\partial\mathbf{z}}{\partial x_j}$ is a column vector). Also the $j$ indexing over $x_j$'s corresponds to rows in the matrix.

So the wikipedia page agrees that it should be $W$, not $W^T$.

As for your second question, things certainly get weird in those notes. From what I can tell, they inexplicably swap to denominator layout for matrices. The reason they do this is to essentially fix "weird" thing about using numerator layout, such as taking derivative of a constant with respect to a matrix, i.e., $\frac{\partial a}{\partial \mathbf{W}}$ is a zero matrix with the dimensions of $\mathbf{W}^T$ not $\mathbf{W}$. The transposing is a smudge factor to compensate for swapping between these notations. The notes emphasize that if you track the dimensions then things should match up, e.g., if you're expecting a column vector but compute a row, then transpose.

My opinion

The notes are more confusing than they are worth (unless you are taking the class). I personally prefer using the third layout convention mentioned in wikipedia page, i.e., numerator layout but write it $\frac{\partial\mathbf{z}}{\partial \mathbf{x}^T}$. This is especially useful for matrices, as the format for $\frac{\partial f(z)}{\partial \mathbf{W}^T}$ is obvious: $$\frac{\partial f(z)}{\partial \mathbf{W}^T} = \begin{bmatrix} \frac{\partial f}{\partial W_{11}} & \frac{\partial f}{\partial W_{21}} & \dotsb & \frac{\partial f}{\partial W_{m1}}\\ \frac{\partial f}{\partial W_{12}} & \frac{\partial f}{\partial W_{22}} & \dotsb & \frac{\partial f}{\partial W_{m2}}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial W_{1n}} & \frac{\partial f}{\partial W_{2n}} & \dotsb & \frac{\partial f}{\partial W_{mn}}\\ \end{bmatrix}$$

Whatever you do, be consistent. For some reason those notes advocate quite the opposite.

Related Question