Backpropagation – Why Is Weights Delta Calculated as Matrix Multiplication of Outputs and Delta?

backpropagationmachine learningneural networkspython

I am going through 15 Steps to Implement a Neural Net and I'm stuck on Step 12 where I should implement my own backpropagation function. The neural network in question has only the input and the output layer and the weight matrix between them.

I was able to get this to work, but I had to modify one equation (line of code). I will go equation by equation and tell you where I had to modify things:

First, select a random sample.

Now, calculate the net matrix and output matrix using the feed-forward
function.

[output, net] = feedforward(random_sample, weights, bias)

Calculate the error vector

error_vector = target_outputs - outputs

This I understand. We are calculating the output from our current neural network with the current weights and then we are looking at how much error we have per each of the outputs.

Then the guide goes on:

Calculate the sensitivity.

delta = hammard(error_vector, activation_diff(net))

The corresponding mathematical expression in the textbook might look
like this:

$δ_k=(t_k−z_k)f′(y_k)$

I addressed this in a separate question, as things were a bit unclear to me here as well.

Finally, the guide says the following:

Calculate the weights delta:

weights_delta = scalar_mul(eta, kronecker(transpose(outputs), delta))

The corresponding mathematical expression in the textbook might look
like this: $w_kj=η(t_k−z_k)f′(y_k)y_j$

Update the weights:

weights = add(weights, weights_delta)

and return the matrix.

I found the formula:

weights_delta = scalar_mul(eta, kronecker(transpose(outputs), delta))

to be incorrect. What I implemented (and what seems to work) is:

weights_delta = learning_rate * np.matmul(np.concatenate([random_sample, biases], axis=1).T, delta)

So I'm matrix multiplying the transposed inputs with delta, not Kronecker multiplying the transposed outputs and delta.

My question is: Have I changed the weights_delta = scalar_mul(eta, kronecker(transpose(outputs), delta)) line correctly? Does it make sense? If yes, someone explain the reasoning behind it, i.e., why is the first line faulty and the second one makes sense? When I did it I was pretty much just looking at matrix dimensions and playing with them so they would match.

Best Answer

You are correct that the weight gradients are computed by multiplying gradients with inputs.

To see why this is, you can simply compute the gradient of the pre-activations (i.e. right before applying the activation function) w.r.t. weights:

$$\frac{\partial s_i}{\partial w_{jk}} = \frac{\partial}{\partial w_{jk}}\bigg(\sum_l w_{il} x_l\bigg) = [i = j]\,x_k,$$

where $[i=j]$ is the Iverson bracket. After all, this is what you will find in the chain rule that is used to compute the gradients:

$$\frac{\partial L}{\partial w_{ij}} = \sum_a \underbrace{\frac{\partial L}{\partial s_a}}_{\delta_a} \underbrace{\frac{\partial s_a}{\partial w_{ij}}}_{[i=a]\,x_j} = \delta_i x_j.$$