Neural Networks – Best Practices for Initial Weights in Neural Networks

neural networksnormalization

I have just heard, that it's a good idea to choose initial weights of a neural network from the range $(\frac{-1}{\sqrt d} , \frac{1}{\sqrt d})$, where $d$ is the number of inputs to a given neuron. It is assumed, that the sets are normalized – mean 0, variance 1 (don't know if this matters).

Why is this a good idea?

Best Answer

I assume you are using logistic neurons, and that you are training by gradient descent/back-propagation.

The logistic function is close to flat for large positive or negative inputs. The derivative at an input of $2$ is about $1/10$, but at $10$ the derivative is about $1/22000$ . This means that if the input of a logistic neuron is $10$ then, for a given training signal, the neuron will learn about $2200$ times slower that if the input was $2$.

If you want the neuron to learn quickly, you either need to produce a huge training signal (such as with a cross-entropy loss function) or you want the derivative to be large. To make the derivative large, you set the initial weights so that you often get inputs in the range $[-4,4]$.

The initial weights you give might or might not work. It depends on how the inputs are normalized. If the inputs are normalized to have mean $0$ and standard deviation $1$, then a random sum of $d$ terms with weights uniform on $(\frac{-1}{\sqrt{d}},\frac{1}{\sqrt{d}})$ will have mean $0$ and variance $\frac{1}{3}$, independent of $d$. The probability that you get a sum outside of $[-4,4]$ is small. That means as you increase $d$, you are not causing the neurons to start out saturated so that they don't learn.

With inputs which are not normalized, those weights may not be effective at avoiding saturation.

Related Question