[Math] Derivatives on hidden layers in backpropagation (ANNs)

derivativesmachine learningneural networks

I'm working on understanding all the math used in artificial neural networks. I have gotten stuck at calculating the error function derivatives for hidden layers when performing backpropagation.

On page 244 of Bishop's "Pattern recognition and machine learning", formula 5.55. The derivative of the error function for a hidden layer is given using a sum of derivatives over all units to which it sends connections.

$$ \frac{\partial E_n}{\partial a_j} = \sum_k \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$

I know the chain rule. If $a_j$ goes into only one other node, we can apply the chain rule to separate the parts. But what is the intuition behind summing these values for all nodes if the output goes into multiple nodes?

Thanks

Best Answer

Thinking in terms of directional derivatives might be more instructive in this case, as we can arrive at the chain rule formulation in a constructive fashion.

Let us consider the directional derivative of a multivariate function $f : \mathbb{R^n} \rightarrow \mathbb{R}$, in an arbitrary direction $u \in \mathbb{R^n}$. Since $u$ is a direction, we shall assume $||\textbf{u}|| = 1$. The directional derivative in the direction of $\textbf{u}$, is then defined as

$$\nabla_u f = \nabla f \cdot \textbf{u}$$

This can easily be proved by noting, that under the usual limit definition of a derivative, we have

$$\nabla_u f (\textbf{x}) = \lim_{h\rightarrow0} \frac{f(\textbf{x} + h\textbf{u}) - f(\textbf{x})}{h }$$

Since we assume differentiability of the objective function $f$, we can find a linear approximant of $f$ around any arbitrary point $\textbf{a}$ that is close to the true value of $f(\textbf{x})$ in any $\epsilon$-neighborhood of $\textbf{a}$ $$f(\textbf{x}) = f(\textbf{a}) + \nabla f(\textbf{a})^T \cdot (\textbf{x} - \textbf{a})$$

Plugging this approximant into our previous limit formulation we get, for any $\textbf{a}$

$$\nabla_u f (\textbf{a}) = \nabla f(\textbf{a}) \cdot \textbf{u}$$

Going back to the original problem, when differentiating the cost function $E_n$ by an arbitrary parameter $a_k$ we also need to take into account the perturbations induced by a change in $a_k$ in any other parameters it interacts with. Specifically, when altering $a_k$ we shall also move along the $\textbf{direction}$ of the perturbed parameters.

This direction essentially encapsulates all the changes caused in intermediate variables $\{a_k\}_{k=1}^{n}$ that appear when altering a certain $a_j$. As such, for every $a_k$ that is directly influenced by $a_j$, we can measure the actual change by evaluating $\frac{\partial a_k}{\partial a_j}$.

Let $\textbf{p}$ be the aforementioned vector, therefore we can write

$$\textbf{p}_k = \bigg(\frac{\partial a_k}{\partial a_j}\bigg)$$

where $k$ runs through all parameters $a_k$ such that $a_k $ is influenced by $a_j$.

Coupling the directional derivative with the previously described concept, we arrive at the desired result, namely

$$\frac{\partial E_n}{\partial a_j} = \nabla E_n ^T \cdot \textbf p = \sum_{k \in \mathcal{S}} \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$

where $\mathcal{S}$ is the set of all indices corresponding to variables directly influenced by $a_j$.

Related Question