When initializing connection weights in a feedforward neural network, it is important to initialize them randomly to avoid any symmetries that the learning algorithm would not be able to break.
The recommendation I have seen in various places (eg. in TensorFlow's MNIST tutorial) is to use the truncated normal distribution using a standard deviation of $\dfrac{1}{\sqrt{N}}$, where $N$ is the number of inputs to the given neuron layer.
I believe that the standard deviation formula ensures that backpropagated gradients don't dissolve or amplify too quickly. But I don't know why we are using a truncated normal distribution as opposed to a regular normal distribution. Is it to avoid rare outlier weights?
Best Answer
I think its about saturation of the neurons. Think about you have an activation function like sigmoid.
If your weight val gets value >= 2 or <=-2 your neuron will not learn. So, if you truncate your normal distribution you will not have this issue(at least from the initialization) based on your variance. I think thats why, its better to use truncated normal in general.