Backpropagation – How to Explain the Use of Delta in Backpropagation

backpropagationmachine learningneural networks

I am going through 15 Steps to Implement a Neural Net and I'm stuck on Step 12 where I should implement my own backpropagation function. The neural network in question has only the input and the output layer and the weight matrix between them.

I will go equation by equation and tell you where I get lost:

First, select a random sample.

Now, calculate the net matrix and output matrix using the feed-forward
function.

[output, net] = feedforward(random_sample, weights, bias)

Calculate the error vector

error_vector = target_outputs - outputs

This I understand. We are calculating the output from our current neural network with the current weights and then we are looking at how much error we have per each of the outputs.

Here is where it gets a bit confusing for me:

Calculate the sensitivity.

delta = hadamard(error_vector, activation_diff(net))

The corresponding mathematical expression in the textbook might look
like this:

$δ_k=(t_k−z_k)f′(y_k)$

This is where I get a bit confused. I looked at the Wikipedia article regarding backpropagation and there delta is mentioned as well, but the formula there includes the derivative of the activation function (which is present in the formula), but also the weights (not present in the above formula). Instead of the weights, we have the error vector. I don't understand why.

Can someone explain to my why delta is calculated in this way and what's it's purpose, that is, what does it represent?

Best Answer

The recursive formula shared in wikipedia is $$\delta^{l-1}=(f^{l-1})'\circ (W^l)^T \delta^l$$ This helps you propagate the derivatives back to the first layers. However, the initial condition for this equation is when $l=L$ ($L$ is the number of layers), and $\delta^L$ should be calculated separately. If you look a bit down on the page, you'll see this initial condition set as $$\delta^L=(f^L)'\circ \nabla_{a^L}C$$ where $a^L$ is the activated output of layer $L$ and $C$ is the cost function according to the definitions in the page. The last derivative term translates into $(t_k-z_k)$ in your equation (the sign differs but that's probably because the tutorial you mentioned adds the gradient instead of subtracting it). So, it is almost the same equation.

Related Question