Connect the gradient $dJ/W_i$ to $dJ/dW_{i-1}$

calculuslinear algebramachine learningmultivariable-calculusneural networks

TLDR:

In light of the fact that this closed-form appears to apply to the weights of the final layer (or a simple two-layer network), How does one relate it to algorithmically finding the cost gradient with respect to previous layers?

The solution found in this online book works well: http://neuralnetworksanddeeplearning.com/chap2.html#the_backpropagation_algorithm

However, the cost function used is MSE in the examples and does not have the problem of divide by zero that is introduced by the derivative of binary cross entropy loss. A good mentor suggested adding a small constant to the denominator, which solved my problem in practice. However, I'm hoping for the sake of simplicity to find a closed form.

Mathematical Context:

This answer to a question on backpropagation implies a closed form for the gradient of a cost function wrt a weight matrix. However, it appears to me that this answer is incorrect in that it refers only to either a nerual network with only two layers (input and output), or at best a deep neural network, but only applied to the final layer.

To illustrate my question, I have altered the answers notation to add indices for clarity on the relationship between one layer and the next. Additionally, the author uses $X$ as a matrix presumably defined by the input of the neural network. I have replaced $X$ with $\mathbf{z}$ to generalize the input of any layer $i = 1 \dots L$. One could consider $X^T = \mathbf{z}_0$ (when X is in wide data form).

\begin{align}
\mathbf{h}_i &= W_i\mathbf{z}_{i-1} \\
\mathbf{z_i} &= \sigma(\mathbf{h}_i) \\
\sigma(\mathbf{h_i}) &= \frac{1}{1 + e^{-\mathbf{h}_i}}\\
J(W) &= -\mathbf{y}\log(\mathbf{z}_L) – (1 -\mathbf{y})\log(1-\mathbf{z}_L)
\end{align}

Here, $L$ refers to the index of the last layer, so $\mathbf{z}_L$ is the probability determination of the neural network.

Now, if I wish to compute the gradient of weight matrix $W_L$ then we can follow the author's logic and use the chain rule:
$$
\frac{\partial{J}}{\partial{W_L}} =
\frac{\partial{J}}{\partial{\mathbf{z}_L}}
\frac{\partial{\mathbf{z}_L}}{\partial{\mathbf{h}_L}}
\frac{\partial{\mathbf{h}_L}}{\partial{W_L}}
$$

Which is indeed just equal to
$$
\mathbf{z}_{L-1}(\mathbf{z}_L – \mathbf{y})
$$

Or if there are only two layers (only one weight matrix):

$$
\frac{\partial{J(W)}}{\partial{W}} =
\mathbf{X}^T (\mathbf{z}-\mathbf{y})
$$

However, this statement does not seem true to me in general, and so when writing code I cannot substitute the above into the below for an arbitrary matrix $W_i$, as the author seems to be implying:

$$
W = W –
\alpha \frac{\partial{J(W)}}{\partial{W}}
$$

To show why this doesn't make sense to me, imagine wanting to calculate the gradient of the second to last weight matrix $W_{L-1}$. Then the chain rule becomes:

$$
\frac{\partial{J}}{\partial{W_{L-1}}} =
\frac{\partial{J}}{\partial{\mathbf{z}_L}}
\frac{\partial{\mathbf{z}_L}}{\partial{\mathbf{h}_L}}
\frac{\partial{\mathbf{h}_L}}{\partial{\mathbf{z}_{L-1}}}
\frac{\partial{\mathbf{z}_{L-1}}}{\partial \mathbf{h}_{L-1}}
\frac{\partial \mathbf{h}_{L-1}}{\partial W_{L-1}}
$$

As you can see, the chain has grown, and when you compute the individual terms of the product, the final result no longer has the same closed form.

Best Answer

Let me try to address a couple of your concerns:

  1. The divide-by-zero issue.

In practice, this doesn't actually happen. Yes, $dJ/d\hat{y} = (\hat{y} - y)/(\hat{y}(1-\hat{y}))$, but you only end up dividing by zero if $\hat{y} = 1$ or $0$, as you've pointed out, which can't happen since $\hat{y} = \sigma(z)$ (where $z$ is the output of the preceding layer), and $\sigma(z)$ has image $(0, 1)$.

  1. Finding $dJ/dW_L$ for an arbitrary layer.

It is true that you use the chain rule as you've demonstrated above; however, you don't use $dJ/dW_{L+1}$ to find $dJ/dW_L$, because no term equivalent to $dW_{L+1}/dW_L$ ever shows up in backpropagation (to see why, draw out the computation graph for a small neural network and see how the gradients propagate). Yes, the final result will no longer have exactly the same closed form.