Neural Networks – Troubleshooting ReLU Network Failures

backpropagationdeep learningneural networksoptimization

So I have a problem. Simple situation: Fully-connected Multi-Layer Perceptron with Rectified Linear (ReLU) units (both hidden and output layers), 1 hidden layer of 100 hidden units, trained with backprop on MNIST with minibatch_size = 1 and MSE loss (to 1-hot target vector), initial weights drawn from $\mathcal{N}(0, 0.01)$. I'm doing this because I'm experimenting with variations on this type of network.

Now, if I set the learning rate above some threshold (in my case around 0.06), I get a very strange learning curve (see pictures). Basically, my network starts off ok, then kind of just … gives up.

What goes up...
x-axis: Epoch of training
y-axis: Score on test (solid) and training (dotted) sets at various points through training. Each colour represents a network trained from the same initial weights, but with a different learning rate.

The weights are not exploding to infinity or anything, as the plots below show.

Final weight plots
Weights from two of the above networks after training.
Top row: weights of failed network with eta=0.07
Bottom row: weights of successful network with eta=0.05
Left column: Weights from input to hidden (reshaped to dimensions of input images)
Right column: Weights from hidden to output.

So it seems that there's some invisible threshold where if you cross it, everything goes to hell. The trouble is I don't understand how to determine the threshold, or what the real cause of failure is. Even changing the random seed can cause the network to train successfully or fail. Does anyone have experience with this phenomenon or know of a paper that examines/explains/proposes a solution for this problem?

EDIT

So, I've identified the problem, but have not yet found the solution. The problem is that output units will sometimes get pushed into a regime where they are "dead" – that is, not responsive to any input. Once they're dead, they're dead – all gradients from that unit become zero, and therefore they can not learn to be useful again. They can more easily be pushed into this regime when the learning rate is higher. An obvious solution is to use a softmax output layer, but that isn't applicable to my particular problem.

Best Answer

The problem is that output units will sometimes get pushed into a regime where they are "dead" - that is, not responsive to any input. Once they're dead, they're dead - all gradients from that unit become zero, and therefore they can not learn to be useful again.

This is a well-known problem with ReLU units. As work-arounds, some folks have designed alternative activation functions which are largely similar, but do not have a flat gradient. The Leaky ReLU function $L$ is probably the simplest

$$ L(x) = \begin{cases} x &\text{if}~x > 0\\ \alpha x &\text{otherwise} \end{cases} $$

where $0 < \alpha< 1$ is some constant chosen by the user, usually something like $\alpha=0.2$. This always has positive gradient, so the weights can always update.

They can more easily be pushed into this regime when the learning rate is higher.

This thread addresses your observation about larger learning rates.

Related Question