Solved – Back propagation and and various cost functions

algorithmsbackpropagationloss-functionsmachine learning

I want to understand and extract a codeable back propagation algorithm. I'm mostly pure ignorant coder and my math skills are very weak. And that what I want to go further with is an extendable abstraction for back propagation algorithm and as deep as possible understanding of it's internals.


Most of tutorials start from the fact that an example uses mean squared error (quadratic) cost function:

1.
$$
E = \tfrac{1}{2} \!\! \sum_{k \in \mathrm{Outputs}} \!\!\! (t_k – o_k)^2
$$

After declaration of that fact ^^ we are jumping to derivation and gradient descent formula is defined as:

2.
$$
\Delta w_{i,j} = -\eta \frac {\partial E}{\partial w_{i,j}}
$$

After a few transformations where I believed that I can trace a logic I have lost the narrative thread we got an algorithm steps close to:

3.
For output layer
$$
\delta _j = -2\alpha o_j(1 – o_j)(t_j – o_j)
$$

4.
For hidden layer
$$
\delta _j = 2\alpha o_j(1 – o_j) \!\! \sum_{k \in \mathrm{Children}(j)} \!\! \delta _k w_{j,k}
$$

Where:

  • $\alpha$ – Learning rate
  • $o_j$ – Output of a neuron which is a result of activation function application to an input $f(x_{j})=o_j$
  • $t_j$ – Desired output for the neuron

I confused about bound between 3, 4 (algorithm steps) and 1 (cost function). I already received a comment on stackoverflow which says that 3, 4 is true for all cost functions. But there is variety of different cost functions and 3 and 4 don't use $E$ directly.

  • Than how is it possible that 3 and 4 is true for every cost function?
  • If 3 and 4 is true for every cost function, how these steps use cost function application result?
  • How do you describe the algorithm using different cost functions? For example Hellinger distance?

Best Answer

The update policy is:

$$ w = w - \alpha \nabla C(t, \sigma, w) $$ where $C(t, \sigma, w)$

where $t$ is the vector of target values, $w$ represents the weights on the edges coming in and $\sigma$ is the activation function.

Let's assume, for the sake of simplicity, that we deal with a simple neural network with one hidden layer and one single output unit.

Let's also assume that the cost function is:

$$C(x, w) = \frac{1}{\sqrt 2} \sum_{i=1}^{m} (\sqrt{\sigma(w; x_i)} - \sqrt{t_i})^2$$

In the calculation of the gradient for the $i$th element in the sample, the gradient in the $j$ weight would be:

$$\frac{\partial C(x,w)}{\partial w_j} = 2 (\sqrt{\sigma(w; x_i)} - \sqrt{t_i}) \sigma(w; x_i)^{-1/2} \frac{\partial \sigma (x,w)}{w_j}$$

The next step would be to define an activation function and carry on with the calculations.

In light of all of this:

If 3 and 4 is true for every cost function, how these steps use cost function application result?

We follow the steps above. The cost function plays a direct role in the calculation of the gradient.

Then how is it possible that 3 and 4 is true for every cost function?

There is a sleight of hand which I have not made use of because I felt it was the source of the confusion. It is easier to consider your output node output $o$ the entire $\sqrt{\sigma(w; x_i)}$.

From the point of view of the notation, this is very convenient because you abstract away from a specific cost function. If you exploit this "notation" and plug the partial derivative in the update policy, you get to see your general statement for the update function.

This generalization is also very convenient for the SW implementation level - it constitutes a higher level abstraction.