Backpropagation – Why Is Weights Delta Calculated as Matrix Multiplication of Outputs and Delta?

backpropagationmachine learningneural networkspython

I am going through 15 Steps to Implement a Neural Net and I'm stuck on Step 12 where I should implement my own backpropagation function. The neural network in question has only the input and the output layer and the weight matrix between them.

I was able to get this to work, but I had to modify one equation (line of code). I will go equation by equation and tell you where I had to modify things:

First, select a random sample.

Now, calculate the net matrix and output matrix using the feed-forward
function.

[output, net] = feedforward(random_sample, weights, bias)

Calculate the error vector

error_vector = target_outputs - outputs

This I understand. We are calculating the output from our current neural network with the current weights and then we are looking at how much error we have per each of the outputs.

Then the guide goes on:

Calculate the sensitivity.

delta = hammard(error_vector, activation_diff(net))

The corresponding mathematical expression in the textbook might look
like this:

$δ_k=(t_k−z_k)f′(y_k)$

I addressed this in a separate question, as things were a bit unclear to me here as well.

Finally, the guide says the following:

Calculate the weights delta:

weights_delta = scalar_mul(eta, kronecker(transpose(outputs), delta))

The corresponding mathematical expression in the textbook might look
like this: $w_kj=η(t_k−z_k)f′(y_k)y_j$

Update the weights:

weights = add(weights, weights_delta)

and return the matrix.

I found the formula:

weights_delta = scalar_mul(eta, kronecker(transpose(outputs), delta))

to be incorrect. What I implemented (and what seems to work) is:

weights_delta = learning_rate * np.matmul(np.concatenate([random_sample, biases], axis=1).T, delta)

So I'm matrix multiplying the transposed inputs with delta, not Kronecker multiplying the transposed outputs and delta.

My question is: Have I changed the weights_delta = scalar_mul(eta, kronecker(transpose(outputs), delta)) line correctly? Does it make sense? If yes, someone explain the reasoning behind it, i.e., why is the first line faulty and the second one makes sense? When I did it I was pretty much just looking at matrix dimensions and playing with them so they would match.

Best Answer

You are correct that the weight gradients are computed by multiplying gradients with inputs.

To see why this is, you can simply compute the gradient of the pre-activations (i.e. right before applying the activation function) w.r.t. weights:

$$\frac{\partial s_i}{\partial w_{jk}} = \frac{\partial}{\partial w_{jk}}\bigg(\sum_l w_{il} x_l\bigg) = [i = j]\,x_k,$$

where $[i=j]$ is the Iverson bracket. After all, this is what you will find in the chain rule that is used to compute the gradients:

$$\frac{\partial L}{\partial w_{ij}} = \sum_a \underbrace{\frac{\partial L}{\partial s_a}}_{\delta_a} \underbrace{\frac{\partial s_a}{\partial w_{ij}}}_{[i=a]\,x_j} = \delta_i x_j.$$

Related Solutions

Solved – backpropagation – bias nodes and error

For simplicity, bias units are subsumed into the equation by extending the input vector adding a component which is always 1. Concretely,

$$ x = (x_{1}, ..., x_{n},1) $$ so that the activation for each unit can then be rewritten as, $$ a_{i} = \sum_{j=1}^{N} w_{ij}x_{j} + w_{i0} = \sum_{j=0}^{N} w_{ij}x_{j} $$

You can see a detailed derivation of the backpropagation rule in the paper neural networks and their applications.

Solved – multidimensional inputs, outputs and backpropagation

Yes, it's possible, not too difficult and therefore also advisable. I'm giving an code example in Java for forward- and backpropagation using org.ejml.simple.SimpleMatrix.

private SimpleMatrix[] forwardPropagate(SimpleMatrix input, SimpleMatrix weightsIj, SimpleMatrix weightsJk, boolean derivatives) {
    SimpleMatrix[] result = new SimpleMatrix[4];

    SimpleMatrix hiddenValue = weightsIj.mult(input);
    SimpleMatrix hiddenActivation = threshold(hiddenValue); //a simple method to calculate the threshold function on every matrix element

    SimpleMatrix hiddenActWithBias = new SimpleMatrix(hiddenActivation.numRows() + 1, hiddenActivation.numCols());
    hiddenActWithBias.insertIntoThis(0, 0, hiddenActivation);
    hiddenActWithBias = setEntireRow(hiddenActWithBias, true, hiddenActWithBias.numRows() - 1, 1);

    SimpleMatrix outputValue = weightsJk.mult(hiddenActWithBias);
    SimpleMatrix outputActivation = threshold(outputValue); //change here if a linear output activation function should be used

    SimpleMatrix derivatives1 = null;
    SimpleMatrix derivatives2 = null;
    if (derivatives) {
        derivatives1 = thresholdderivative(hiddenValue); //the derivatives are needed if MSE (mean square error) cost function is to be applied, not needed for cross entropy
        derivatives2 = thresholdderivative(outputValue);
    }

    result[0] = outputActivation;
    result[1] = hiddenActWithBias;
    result[2] = derivatives1;
    result[3] = derivatives2;
    return result;
}

private SimpleMatrix[] backpropagate(SimpleMatrix input, SimpleMatrix[] forwardPropagateResult, SimpleMatrix trainingweights1, SimpleMatrix trainingweights2,
                                     SimpleMatrix targetOutput, boolean derivatives, double eta) {
    SimpleMatrix deltaOutput;
    SimpleMatrix weightsOutDiff;
    SimpleMatrix weightsHiddenDiff;
    SimpleMatrix[] result = new SimpleMatrix[3];

    if (derivatives) {
        deltaOutput = forwardPropagateResult[0].minus(targetOutput); //vector(s) of the difference between calculated and training output
        SimpleMatrix deltaOutDer = deltaOutput.elementMult(forwardPropagateResult[3]); //multiplied by the partial derivative, equals delta of the output layer
        SimpleMatrix backweights2 = new SimpleMatrix(trainingweights2.numRows(), trainingweights2.numCols());
        backweights2.insertIntoThis(0, 0, trainingweights2);
        backweights2.reshape(backweights2.numRows(), backweights2.numCols() - 1); // "delete" the connections to the bias node, as it doesn't backpropagate into layer 1
        SimpleMatrix deltaHiddenPart = backweights2.transpose().mult(deltaOutDer); // deltaOut being sent back through the weights to the hidden layer
        SimpleMatrix deltaHidden = deltaHiddenPart.elementMult(forwardPropagateResult[2]); //multiplied by the partial derivative, equals delta of the hidden layer
        weightsOutDiff = deltaOutDer.mult(forwardPropagateResult[1].transpose()).scale(-eta); //weights difference vector for the output layer
        weightsHiddenDiff = deltaHidden.mult(input.transpose()).scale(-eta); //weights difference vector for the hidden layer
    } else {
        deltaOutput = forwardPropagateResult[0].minus(targetOutput); //vector of the difference between calculated and training output
        SimpleMatrix backweights2 = new SimpleMatrix(trainingweights2.numRows(), trainingweights2.numCols());
        backweights2.insertIntoThis(0, 0, trainingweights2);
        backweights2.reshape(backweights2.numRows(), backweights2.numCols() - 1); // "delete" the connections to the bias node, as it doesn't backpropagate into layer 1
        SimpleMatrix deltaHidden = backweights2.transpose().mult(deltaOutput); // deltaOut being sent back through the weights to the hidden layer
        weightsOutDiff = deltaOutput.mult(forwardPropagateResult[1].transpose()).scale(-eta); //weights difference vector for the output layer
        weightsHiddenDiff = deltaHidden.mult(input.transpose()).scale(-eta); //weights difference vector for the hidden layer
    }

    result[0] = deltaOutput;
    result[1] = weightsOutDiff;
    result[2] = weightsHiddenDiff;
    return result;
}

Best Answer

Related Solutions

Solved – backpropagation – bias nodes and error

Solved – multidimensional inputs, outputs and backpropagation

Related Question