D(vector)/d(matrix) in neural-network-deep-learning context

When learning about Neural Networks, you are bound to stumble upon the technique of backpropagation/gradient descent. To give a brief explanation of the scenario we are to look at,

$$L = g(Wx+b) = g(a).$$

L is a scalar, W is a matrix of appropriate size, x is a vector given as input, b is a vector of appropriate size. g is a function that takes a vector of appropriate size and evaluates a scalar.

W and b are to be updated in the opposite direction of the gradient, so that L can be minimized.

Many textbooks, lectures and blogs, including Speech and Language Processing (3rd ed. draft) or CS231n, use notations(See page 5) that look like this.

$$\frac{\partial L}{\partial V} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial V}$$

here, L is the "loss function" defined above, a, z are vectors of appropriate size, and V is a matrix of parameters, like W discussed above. Overall this equation was written so that we may find the derivative of L by "the weights" V. From this, it seems only logical to assume that we may attempt to get $\frac{\partial L}{\partial W}$ in this manner.

$$\frac{\partial L}{\partial W} = \frac{\partial g}{\partial a} \frac{\partial a}{\partial W}$$

But here is the problem I have with both expressions: on the right-hand side, there is a derivation of a vector by a matrix. Now, from a few sources I know that the notation nor the multiplication of this Frankenstein is not defined rigorously, if it is defined at all.

Since we typically add $\eta \frac{\partial L}{\partial W}$ to $W$ directly, we must be following the Denominator-layout notation as explained here, but I can't seem to figure the $\partial a/\partial W$ part out. I did look over as many blogposts as I can, but virtually none of them actually show how the calculation should be done.

So my question is as follows.

Is that kind of notation(da/dW) even legal?

If it is legal, how should it be evaluated and multiplicated in this context? Can you give an example?

Is da/dW possible to evaluate without doing it term-by-term?(evaluating da/dW11, da/dW12, … and compiling them up later in some way.) What about dL/dW?

Thank you in advance.

Best Answer

When tensors get involved (and $\frac{\partial \text{vector}}{\partial\text{matrix}}$ is a 3d tensor), it is useful to keep in mind that the chain rule actually states

$$ D[f\circ g](x_0) = D[f](g(x_0)) \circ D[g](x_0) $$

I.e. the derivative of composition of functions is the composition of their derivatives. Now in the context of linear algebra, matrix multiplication "$\cdot$" is essentially the same as the composition "$\circ$" of linear maps, so we tend to write $D[f\circ g] = D[f]\cdot D[g]$ instead. However tensors can be combined linearly in many different ways, and this notation tends to come to its limits.

I recommend to take a look at http://www.matrixcalculus.org/ (CAS engine that can help to check formulas).

This paper https://arxiv.org/pdf/1208.0197.pdf may also interest you.

Best Answer

Related Solutions

Computing Neural Network Gradients

Computing the gradient of a two-layer neural network w.r.t. the data

Related Question