Solved – What values should initial weights for a ReLU network be

neural networksnormalization

For a standard feed-forward Neural Network, what range should my initial weights fall under if I'm planning to use Rectified Linear Unit as an activation function? A mathematical justification for the recommendation given would also be helpful.

I've read this post regarding initialisation of weights, however it assumes Sigmoid as the activation function. In another post's comments, someone recommends choosing between (0,0.01) or (0, n**(-0.5)) where " 'n' is is the number and length of paths from the current layer".

Can anyone confirm or suggest methods?

Best Answer

There has been quite a lot of theoretical work on Neural Network initialization in the last 5 years, which apparently hasn't still propagated to the wider Deep Learning community. While it's true that there isn't still an initialization which works for all architectures and for all activation functions (and most likely there will never be, from what we have understood so far about the dynamics of deep neural networks) , in practice this isn't a huge limit because most users use two or three activation functions (ReLU for CNNs, tanh and sigmoid for LSTMs, AKA the only RNNs used by most people) and two or three architectures (i.e., ResNets for image classification and LSTMs for sequence prediction or time series forecasting). For these things we do have some powerful results. Granted, they won't work for the Universal Transformer just presented at ICML 2018, but frankly, right now there are much more people trying to apply "standard" architectures such as ResNets to interesting business problems, who need better initializations than the Xavier one, than there are people inventing bleeding-edge architectures, who will always have to use the classic, infallible initialization strategy: elbow grease, also known as "long and boring computational experiments, supported by careful book-keeping".

Deep Linear Networks

One of the first main results is shown in the seminal work "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks" of Andrew Saxe et al., 2014, on optimal activation for deep linear network, which used results from random matrix theory and in particular from free probability theory (AKA the theory of probability for noncommutative random variables, such as random matrices, precisely), to show that using random orthogonal weights provides far superior results than the usual scaled Normal initializations. The theoretical concept behind this result is that of dynamical isometry, i.e., having the product of Jacobians associated with error signal backpropagation act as a near isometry, up to some overall global $O(1)$ scaling, on a subspace (of the weight space) of as high a dimension as possible. This is equivalent to having as many singular values of the product of Jacobians as possible within a small range around an $O(1)$ constant, and is closely related to the notion of restricted isometry in compressed sensing and random projections.

Deep Nonlinear Networks

However, the results from Saxe don't easily translate to "real" NNs, with nonlinear activation functions. One can try and apply the random orthogonal matrix initialization, and there are "regimes", such as "the edge of chaos", where it will indeed work incredibly well for an architecture otherwise horribly hard to train (a fully connected neural network with 100 layers, 1000 units per layer and tanh activations (!!!)). But outside of this regime, there's no guarantee: results can even be worse than Xe initialization, so you're back to "try and see".

However, 3 years after, two different papers appeared which extended Saxe's work to nonlinear neural networks by trying to achieve dynamical isometry for nonlinear networks. One is the work by Pennington et al, "Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice", 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. This paper fails to find a useful initialization for ReLU networks, while it finds a very good one for tanh networks. Since you're interested in the latter, I'll skip this one, just listing it here for completeness.

The other is Balduzzi et al.'s well-known Shattered Gradients paper (Balduzzi et al., "The Shattered Gradients Problem: if resnets are the answer, then what is the question?", arXiv:1702.08591v2, 2018) which proposes to Saxe's random orthogonal weights with the so-called Looks-Linear (LL) activation. The approach is as follows:

  1. substitute all ReLU with CReLU, i.e., concatenated rectifiers: basically, these are very similar to a ReLU unit, but instead than

    $$ \sigma(x) = \max(0,x)$$

    we have $$ \boldsymbol{\rho}(x) = (\max(0,x), \max(0,-x)) $$

    (note that $\boldsymbol{\rho}(x):\mathbb{R}\to\mathbb{R}^2$). While you're here, you would probably like to reduce the number of units by 2, to keep the same number of parameters as before.

  2. now, each layer has a weight matrix $W_l$, which has twice as many elements as before, if you didn't reduce the number of units in each layer to achieve parameter parity with your initial architecture. In any case, it's a matrix with an odd number of columns, because of the CReLU activation function, thus $W_l=[W_{1l}, W_{2l}]$ where $W_{1l}$ and $W_{2l}$ have the same shape. Now, for each layer sample a matrix $W'_l$ with orthogonal columns, of the same shape as $W_{1l}$, and initialize $W_l$ as $W^0_l=[W'_l, -W'_l]$. Clearly, at initialization you now have a linear network because

    $$ \boldsymbol{\rho}(W^0_l \mathbf{x})=W'_l\sigma(\mathbf{x})-W'_l\sigma(-\mathbf{x})=W'_l\mathbf{x} $$

    which is why we call this initalization LL (looks-linear).

The LL-init can be "extended" easily to CNNs (see the cited paper for details). It does have the disadvantage of forcing you to change your architecture, though it's admittedly a simple change.

Finally, the most impressive result, based on a mix of theory and experiments is the Delta-Orthogonal initialization of Xiao et al.,"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks", arXiv:1806.05393v2, 2018. This initializer has obtained amazing results, such as allowing successful training of a 10000 layers vanilla CNN with tanh activations, without nearly any regularization techinque (no dropout, no residual connections, no Batch Norm, no weight decay and no learning rate decay: the network relies only on SGD with momentum for regularization). The initialization is even included in Tensorflow as the ConvolutionOrthogonal initializer in the suite of initialization operators of Tensorflow.

Sadly, this initializer only works its magic on tanh CNNs, while it's not guaranteed to deliver its amazing speedups for ReLU CNNs. And tanh CNNs do suck at object classification: the current SOTA for a tanh CNN on CIFAR-10 is a test error of more than 10%, while with modified ReLU ResNets we go below 3%.

Related Question