Solved – What are the theoretical/practical reasons to use normal distribution to initialize the weights in Neural Networks

deep learningmachine learningneural networksnormal distributionweights

I'm aware that there are many different practices of initializing the weights when training a neural network. It seems traditionally standard normal distribution is the first choice. Most articles I found argue there are better ways to initialize the weights other than using normal distrubtion, but they did not explain why normal distribution would at least work.

(1) I think restricting the weights to have mean at 0 and std at 1 can make the weights as small as possible, which make it convenient for regularization. Am I understanding it correctly?

(2) On the other hand, what are the theoretical/practical reasons to use the normal distribution? Why not sampling random weights from any other arbitrary distributions? Is it because normal distribution has the maximum entropy given the mean and variance? Having the maximum entropy means it's most possible chaotic and thus making least assumptions about the weights. Am I understanding it correctly?

Best Answer

(1) I think restricting the weights to have mean at 0 and std at 1 can make the weights as small as possible, which make it convenient for regularization. Am I understanding it correctly?

No, setting them all to 0 would make them as small as possible.

(2) On the other hand, what are the theoretical/practical reasons to use the normal distribution? Why not sampling random weights from any other arbitrary distributions? Is it because normal distribution has the maximum entropy given the mean and variance? Having the maximum entropy means it's most possible chaotic and thus making least assumptions about the weights. Am I understanding it correctly?

I don't think there's too much logic in that decision, perhaps besides the fact that the gaussian distribution is a good "default prior" as many things follow a gaussian distribution. In fact one popular default initialization scheme by Glorot et. al prescribes a uniform distribution, not a normal distribution.

In fact what probably happens is 1. authors provide a theoretical justification for what variance the distribution of the initial weights should be. 2. they choose an arbitrary distribution with that variance. Of course the normal distribution is then a very natural and easy choice!

Related Question