Neural Networks and the Chain Rule

chain rulemachine learningneural networksoptimization

With neural networks, back-propagation is an implementation of the chain rule. However, the chain rule is only applicable for differentiable functions. With non-differentiable functions, there is no chain rule that works in general. And so, it seems that back-propagation is invalid when we use a non-differentiable activation function (e.g. Relu).

The words that are stated around this seeming error is that "the chance of hitting a non-differentiable point during learning is practically 0". It's not clear to me, though, that landing on a non-differentiable point during learning is required in order to invalidate the chain rule.

Is there some reason why we should expect back-propagation to yield an estimate of the (sub)gradient? If not, why does training a neural network usually work?

Best Answer

The answer to this question might be more clear now with the following two papers:

  1. Kakade and Lee (2018) https://papers.nips.cc/paper/7943-provably-correct-automatic-sub-differentiation-for-qualified-programs.pdf

  2. Bolte and Pauwels (2019) https://arxiv.org/pdf/1909.10300.pdf

As you say, it is wrong to use the chain rule with ReLU activation functions. Evenmore the argument that "the output is differentiable almost everywhere implies that the classical chain rule of differentiation applies almost everywhere" is False. see Remark 12 in the second reference.

Related Question