Solved – Understanding weight distribution in neural network

neural networks

I am training a deep neural network with several convolutional layers and a fully connected layer at the bottom and I am generating histograms of the weight distributions to try and understand how the network is training.

When looking at the graphs, I found something puzzling: most of the weights are near zero and only a small portion of the weights are getting very large. Why is this happening? Is this good and expected, or is this undesirable? Although I have only posted two example layers, this is happening throughout my network.

Weight distribution histograms

Additional information:

  • Data is very sparse and nearly binary (mostly 1's, very few 0's)
  • Input is normalized to be in the range 0-1
  • Not using L1/L2 yet since weights are mostly small
  • Activations are all leaky Relu (a=0.3)
  • I am performing batch normalization after each preactivation

Best Answer

This is expected. Weights in a CNN form feature detectors, so that a certain pattern in an image is connected to strong weights, but the rest of the image pixels should not cause any activations in the next layer neurons.

Only a small fraction of neurons in a layer is activated every time an image is shown, and a small fraction of weights is needed to be large to activate (or suppress) any particular neuron. Moreover, the number of patterns a network needs to detect is fairly small, especially in the early layers. Therefore, overall connectivity for the network is usually very sparse.

The same reasoning applies to regulatization methods, such as L2/L1 - forcing the weights to be small makes the network more robust to noise in the data, and forces the network to learn only the features present in many images.