I have just heard, that it's a good idea to choose initial weights of a neural network from the range $(\frac{-1}{\sqrt d} , \frac{1}{\sqrt d})$, where $d$ is the number of inputs to a given neuron. It is assumed, that the sets are normalized – mean 0, variance 1 (don't know if this matters).
Why is this a good idea?
Best Answer
I assume you are using logistic neurons, and that you are training by gradient descent/back-propagation.
The logistic function is close to flat for large positive or negative inputs. The derivative at an input of $2$ is about $1/10$, but at $10$ the derivative is about $1/22000$ . This means that if the input of a logistic neuron is $10$ then, for a given training signal, the neuron will learn about $2200$ times slower that if the input was $2$.
If you want the neuron to learn quickly, you either need to produce a huge training signal (such as with a cross-entropy loss function) or you want the derivative to be large. To make the derivative large, you set the initial weights so that you often get inputs in the range $[-4,4]$.
The initial weights you give might or might not work. It depends on how the inputs are normalized. If the inputs are normalized to have mean $0$ and standard deviation $1$, then a random sum of $d$ terms with weights uniform on $(\frac{-1}{\sqrt{d}},\frac{1}{\sqrt{d}})$ will have mean $0$ and variance $\frac{1}{3}$, independent of $d$. The probability that you get a sum outside of $[-4,4]$ is small. That means as you increase $d$, you are not causing the neurons to start out saturated so that they don't learn.
With inputs which are not normalized, those weights may not be effective at avoiding saturation.