How derivative of matrix leads to its transpose

calculusderivativesmachine learningneural networkspartial derivative

I'm deriving backpropagation step of training neural networks using vectorized equations. Following are two forward propagation equations between last hidden layer and output layer.

$$ Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} $$
$$ A^{[2]} = \hat{y} = g(Z^{[2]})$$

Where $g(z)$ is activation function.

Now in backpropagation, we calculate the change in cost function ($J$) w.r.t. all the parameters.

I've successfully calculated $\partial{J}/\partial{A^{[2]}}$ and $\partial{J}/\partial{Z^{[2]}}$ using chain rule. Now to calculate $\partial{J}/\partial{W^{[2]}}$, I've formed following chain

$$ \frac{\partial{J}}{\partial{W^{[2]}}} = \frac{\partial{J}}{\partial{A^{[2]}}} \frac{\partial{A^{[2]}}}{\partial{Z^{[2]}}} \frac{\partial{Z^{[2]}}}{\partial{W^{[2]}}}$$

Now to calculate $\frac{\partial{Z^{[2]}}}{\partial{W^{[2]}}}$, I used $ Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} $ equation which simply gives the derivation $ A^{[1]} $. But in literature, it's given as $ A^{[1]^{T}} $, which is transpose of my answer.

By checking dimensions of the answer, I could verify that the answer should be $ A^{[1]^{T}} $ instead if $ A^{[1]} $. But is there any general rule for such cases using which I could directly tell whether the derivative will be the transpose of a matrix or not without verifying dimensions?

I've also checked matrix cookbook but couldn't find any related thumb rules.

Best Answer

Unfortunately, there are 2 common convention in matrix calculus: Jacobian (Numerator) and Gradient (Denominator).

For a function $f:\mathbb R^n \to \mathbb R^m$, then

  1. Jacobian convention: $\frac{\partial f}{\partial x}$ is represented by the $m\times n$ matrix with entries $\big(\tfrac{\partial f}{\partial x}\big)_{ij} = \tfrac{\partial f_i}{\partial x_j}$

  2. Gradiant convention: $\frac{\partial f}{\partial x}$ is represented by the $n\times m$ matrix with entries $\big(\tfrac{\partial f}{\partial x}\big)_{ij} = \tfrac{\partial f_j}{\partial x_i}$

If $f(x) = Ax$ then one can easily check that

  1. Under Jacobian convention $\frac{\partial f}{\partial x} = A$

  2. Under Gradient convention $\frac{\partial f}{\partial x} = A^T$

So it is a matter of which convention you are following.

Related Question