Matrix-calculus – Understanding numerator/denominator layouts

derivativegradient descentmachine learningmatrix-calculusneural networks

Also see this question for more external references!

Consider the following machine-learning model:

Network-architecture

Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of training-examples.

While performing reverse-mode differentiation (or back-propagation), I have the following questions:

Using the numerator-layout.

  1. What would be the dimension of the derivative $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}}$?
    • Should it be a column-vector of dimension $(m, 1)$, because $\mathbf{L}$ is a row-vector of dimension $(1, m)$ (Source: here)
      • However, using this notation causes issues while computing the derivative $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{a}} = \frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}} \frac{\mathrm{d} \mathbf{L}}{\mathrm{d} \mathbf{a}}$; since, $\frac{\mathrm{d} \mathbf{L}}{\mathrm{d} \mathbf{a}}$ would be an $(m, m)$ matrix, while $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}}$ is an $(m, 1)$ vector.
      • But, this notation does serve well when computing the derivatives of the form $\frac{\mathrm{d}y}{\mathrm{d}\mathbf{X}}$, where $y = f(X)$; $\mathbf{X}$ is a matrix of dimension $(m, n)$; and $f(\mathbf{X})$ is a scalar-valued function.
    • Or should it be a row-vector, because according to the numerator-layout the derivative has the dimensions –> $\text{numerator-dimension} \times (\text{denominator-dimension})^\intercal = (1,1)\times(m, 1)$ (Source: here)
      • Also, (for this point) is my understanding even correct?

PS: also, is there any definitive guide from which I can learn matrix-calculus from the first principals. Although, the following sources are good, they still leave a lot of gaps:

Best Answer

If you think of $L$ as a column vector, then I think both your sources agree that $\frac{dJ}{dL}$ should be a row vector.

But what if you really want $L$ as a row vector. Surely, the math shouldn't "care" about how you arrange your collection of numbers. One way to clarify this is by designating dimensions of your objects as "covariant" or "contravariant".

Many things are contravariant, meaning they change opposite to a change in basis (if you go from a bigger unit, "hours" to a smaller unit "seconds", your measurements become bigger). On the other hand, a derivative, like "m/hour" becomes smaller when you change the units to "m/second", hence "co".

Things which are "co" can be multiplied with things which are "contra", e.g. 5 m/second * 10 seconds = 50m. Yet it makes much less sense to multiply two "contra" or two "co" together (admittedly, second^2 or m^2/second^2 are sometimes useful units, but this is not always the case).

So yes, you could say that $\frac{dJ}{dL}$ is a "column" covector with size $m$, and $\frac{dL}{da}$ is a matrix with shape (contra-$m$, co-$m$). We could write $\left(\frac{dJ}{dL}\right)^i = \frac{\partial J}{\partial L_i}$, and $\left(\frac{dL}{da}\right)_i^j = \frac{\partial L_i}{ \partial a_j}$ (we give superscripts to "co" dimensions, and subscripts to "contra", to make things clear). Then, following our rule that co can only be multipled by contra, we see that

$$\left(\frac{dJ}{da}\right)^j = \sum_{i=1}^m \left(\frac{dJ}{dL}\right)^i \left(\frac{dL}{da}\right)_i^j = \left(\frac{dJ}{dL}^T \frac{dL}{da} \right)^j$$

So even if you "force" $\frac{dJ}{dL}$ into a column, if you want to respect our new multiplication rule, you need to transpose before applying matrix mult.

To take this a step further, let's say we are interested in $\frac{da}{dX}$, which has shape (contra-$m$, co-$(n,m)$): $\left( \frac{da}{dX} \right)_j^{u,v} = \frac{\partial a_j}{\partial X_{u,v}}$. Then we have

$$\left(\frac{dJ}{dX}\right)^{u,v} = \sum_{j=1}^m \left(\frac{dJ}{da}\right)^j \left(\frac{da}{dX}\right)_j^{u,v}$$


To translate this back to "numerator layout" matrix calculus terms, you could say that column vectors are always contravariant, row vectors are always covariant or "covectors", gradients are covariant, hence always row vectors. An $m$ by $n$ Jacobian matrix is contra-$m$, co-$n$. This works nicely because if you think of a column vector as a (contra-$n$, co-1) matrix or a row vector as a (contra-1, co-$m$) matrix, notice that by following the ordinary rules of matrix mutliplcation, you'll never accidentally multiply two contra / two co together, and the product of two objects will always be in a (contra, co) form. On the other hand, "denominator layout" has everything in (co, contra) form, which is just as fine and accomplishes the same thing.

However, if you start working with less standard objects, like the derivative of a matrix with respect to a vector, or the derivative of a row vector with respect to a column vector (as in our example above), then you'll need to keep track for yourself what is covariant and what is contravariant.

Related Question