Solved – Why does SGD and back propagation work with ReLUs

backpropagationdeep learningneural networks

ReLUs are not differentiable at the origin. However, they are widely used in Deep Learning together with Stochastic Gradient Descent algorithms and Backpropagation, where the gradients of the loss function are calculated with the chain rule.

How do these algorithms calculate derivatives given that ReLUs are not differentiable at x=0 ?

Best Answer

At x = 0, the ReLU function is no longer differentiable, however it is sub-differentiable and any value in the range [0,1] is a valid choice of sub-gradient. You may see some implementations simply use 0 sub-gradient at the x = 0 singularity. For further details see the Wikipedia article: Subdervative.

Related Solutions

Solved – Is Theano using back-propagation

Theano creates a symbolic graph. This graph allows it to compute derivatives based on the connected inputs, the Op implemented on the Variables, and the output(created by the Apply Node).

import theano.tensor as T
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y

The Apply nodes are blue, Variables are red, Op is green, and Types are purple.

As given in the theano official documentation,

Having the graph structure, computing automatic differentiation is simple. The only thing tensor.grad() has to do is to traverse the graph from the outputs back towards the inputs through all apply nodes (apply nodes are those that define which computations the graph does). For each such apply node, its op defines how to compute the gradient of the node’s outputs with respect to its inputs. Note that if an op does not provide this information, it is assumed that the gradient is not defined. Using the chain rule these gradients can be composed in order to obtain the expression of the gradient of the graph’s output with respect to the graph’s inputs .

Comparing with the Python language, an Apply node is Theano’s version of a function call whereas an Op is Theano’s version of a function definition.

While finding derivatives by hand is simple for feed forward neural networks, it becomes exceedingly complex in the case of Recurrent Neural Networks and Long Short Term Memory Cells, especially if the network is deep.

Solved – Numerical check of gradient in neural network

What is $J(\theta)$?

If it is the output (without activation functions) directly,

$J(\theta)=ax_1+bx_2+c-y$.

Then $\frac{\partial J}{\partial a} = x_1=1$ is correct.

If it is the squared loss,

$J(\theta)=\frac{1}{2}(ax_1+bx_2+c-y)^2$,

Then,

$\frac{\partial J}{\partial a} = (ax_1+bx_2+c-y)x_1=0.34+0.54-y$,

which might be the value 0.7744 you got using backpropogation.

Perhaps you might have used different $J(\theta)$ for backpropogation and numerical check. Hope that helps.

Best Answer

Related Solutions

Solved – Is Theano using back-propagation

Solved – Numerical check of gradient in neural network

Related Question