Solved – Are there any ways to deal with the vanishing gradient for saturating non-linearities that doesn’t involve Batch Normalization or ReLu units

I wanted to train a network with non-linearities that suffer from the vanishing (or exploding gradient problem though mainly vanishing). I know that the (current) standard way is to use batch normalization 1 [BN]1 or simply abandon the non-linearity and use ReLu Rectifier/ReLu units.

I wanted two things:

Stick with my non-linearity, so I don't want to abandon it and use the ReLu (i.e. no ReLu's allowed!). Re-parametrising the non-linearity is ok, say putting a multiplicative in front of it as in $\theta(s)$ for example.
Ideally, I did not want to rely too much batch normalization (or at least if its used, it has to be used in a novel way other than how it was used in the original paper or generalize to many non-linearities). One of the reasons I wanted to avoid Batch Normalize is because it seems to only work for specific non-linearities. For example, for sigmoids, tanh but its unclear how they'd work for other non-linearities, say gaussians.

The reason I have these constraints is because I'd like to deal with the problem of vanishing gradient or exploding gradients by taling the problem directly rather than hacking a solution that works only for specific non-linearities or just avoiding the problem by shoving in a ReLu.

I was wondering, with those two constraints, what are alternative ways to deal with the vanishing gradient problem? (another non-linearity in consideration would be RBF gaussian kernel with euclidean norm pre-activation, sigmoid, tanh, etc)

The possible (vague) ideas I had in mind would be:

Have good initialization so that the saturating non-linearities don't start already saturated (saturated non-linearities result in gradients close to zero).
For RBF, similarly, good init might be important because gaussians mostly have a large value close to 0 (i.e. when filters are similar to its activation or data). Thus, having them too big or too small has a similar vanishing gradient issues.
I don't really know if this is too constraining but it would be nice if there was a different way to use batch normalization other than its traditional suggestion in the original paper (or maybe some BN idea that generalizes to a bigger set of non-linearities, currently it seems most of the research is to show it works for sigmoids as far as I know).
Another idea could be to instead of having non-linearity $\theta(z)$ we have $a \theta(z) $ where $a \in \mathbb{R}$. If $a > 1$, then it means that the non-linearities are not multiplied backwards multiple times for each layer, so as to avoid to be "vanished" for earlier layers. It might make the learning rule unstable, so maybe some regularizer might be a good idea.
An optimizer that intrinsically deals with the vanishing gradient (or at least updating each parameter differently). For example, if its a layer closer to the input, then the learning step should be larger. It would be nice for the learning algorithm to take this into account by itself so to deal with the vanishing gradient.

If there are any suggestions on how to deal with vanishing gradient other than batch-norm or ReLu's I'd love to hear about them!

It seems that vanishing gradient happens mainly because the non-linearities have the property that $ |a| < 1$ and also because $ | \theta'(s) | < 1$ and after multiplying it many times, it either explodes or vanish. Explicitly saying the problem might help solve it. The issue is that it causes lower layers to not update or hinders signal through the network. It would be nice to maintain this signal flowing through the network, during the forward and backward pass (and also during training, not only at initialization).

1: Ioffe S. and Szegedy C. (2015),
"Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift",
Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.
Journal of Machine Learning Research: W&CP volume 37

Best Answer

Have you looked into RMSProp? Take a look at this set of slides from Geoff Hinton:

Overview of mini-batch gradient descent

Specifically page 29, entitled 'rmsprop: A mini-batch version of rprop', although it's probably worth reading through the full set to get a fuller idea of some of the related ideas.

Also related is Yan Le Cun's No More Pesky Learning Rates

and Brandyn Webb's SMORMS3.

The main idea is to look at the sign of gradient and whether it's flip-flopping or not; if it's consistent then you want to move in that direction, and if the sign isn't flipping then whatever step you just took must be OK, provided it isn't vanishingly small, so there are ways of controlling the step size to keep it sensible and that are somewhat independent of the actual gradient.

So the short answer to how to handle vanishing or exploding gradients is simply - don't use the gradient's magnitude!

Best Answer

Related Solutions

Solved – is scaling data [0,1] necessary when batch normalization is used

Solved – What problem does Residual Nets solve that batch normalization does not solve

Related Question