Neural Networks – Benefits of Using Truncated Normal Distribution for Initializing Weights

backpropagationneural networkstruncated normal distributionweights

When initializing connection weights in a feedforward neural network, it is important to initialize them randomly to avoid any symmetries that the learning algorithm would not be able to break.

The recommendation I have seen in various places (eg. in TensorFlow's MNIST tutorial) is to use the truncated normal distribution using a standard deviation of $\dfrac{1}{\sqrt{N}}$, where $N$ is the number of inputs to the given neuron layer.

I believe that the standard deviation formula ensures that backpropagated gradients don't dissolve or amplify too quickly. But I don't know why we are using a truncated normal distribution as opposed to a regular normal distribution. Is it to avoid rare outlier weights?

Best Answer

I think its about saturation of the neurons. Think about you have an activation function like sigmoid.

enter image description here

If your weight val gets value >= 2 or <=-2 your neuron will not learn. So, if you truncate your normal distribution you will not have this issue(at least from the initialization) based on your variance. I think thats why, its better to use truncated normal in general.