Solved – Understanding weight distribution in neural network

neural networks

I am training a deep neural network with several convolutional layers and a fully connected layer at the bottom and I am generating histograms of the weight distributions to try and understand how the network is training.

When looking at the graphs, I found something puzzling: most of the weights are near zero and only a small portion of the weights are getting very large. Why is this happening? Is this good and expected, or is this undesirable? Although I have only posted two example layers, this is happening throughout my network.

Additional information:

Data is very sparse and nearly binary (mostly 1's, very few 0's)
Input is normalized to be in the range 0-1
Not using L1/L2 yet since weights are mostly small
Activations are all leaky Relu (a=0.3)
I am performing batch normalization after each preactivation

Best Answer

This is expected. Weights in a CNN form feature detectors, so that a certain pattern in an image is connected to strong weights, but the rest of the image pixels should not cause any activations in the next layer neurons.

Only a small fraction of neurons in a layer is activated every time an image is shown, and a small fraction of weights is needed to be large to activate (or suppress) any particular neuron. Moreover, the number of patterns a network needs to detect is fairly small, especially in the early layers. Therefore, overall connectivity for the network is usually very sparse.

The same reasoning applies to regulatization methods, such as L2/L1 - forcing the weights to be small makes the network more robust to noise in the data, and forces the network to learn only the features present in many images.

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

I guess I got what is a problem with a gradient norm value. Basically negative gradient shows a direction to a local minimum value, but it doesn't say how far it is. For this reason you are able to configure you step proportion. When your weight combination is closer to the minimum value your constant step could be bigger than is necessary and some times it hits in wrong direction and in next epooch network try to solve this problem. Momentum algorithm use modified approach. After each iteration it increases weight update if sign for the gradient the same (by an additional parameter that is added to the $\Delta w$ value). In terms of vectors this addition operation can increase magnitude of the vector and change it direction as well, so you are able to miss perfect step even more. To fix this problem network sometimes needs a bigger vector, because minimum value a little further than in the previous epoch.

To prove that theory I build small experiment. First of all I reproduce the same behaviour but for simpler network architecture with less number of iterations.

import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from neupy import algorithms

plt.style.use('ggplot')

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))

data, target = make_regression(n_samples=10000, n_features=50, n_targets=1)

target_scaler = preprocessing.MinMaxScaler()
target = target_scaler.fit_transform(target)

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
    )),
])

mnet.fit(data, target, momentum__epochs=100)

After training I checked all gradients on plot. Below you can see similar behaviour as yours.

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

Also if look closer into the training procedure results after each epoch you will find that errors are vary as well.

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

Next I using almost the same settings create another network, but for this time I select Golden search algorithm for step selection on each epoch.

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))
    if network.epoch % 20 == 0:
        print("Epoch #{}: step = {}".format(network.epoch, network.step))

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
        optimizations=[algorithms.LinearSearch]
    )),
])

mnet.fit(data, target, momentum__epochs=100)

Output below shows step variation at each 20 epoch.

Epoch #0: step = 0.5278640466583575
Epoch #20: step = 1.103484809236065e-13
Epoch #40: step = 0.01315561773591515
Epoch #60: step = 0.018180616551587894
Epoch #80: step = 0.00547810271094794

And if you after that training look closer into the results you will find that variation in 2-norm is much smaller

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

And also this optimization reduce variation of errors as well

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

As you can see the main problem with gradient is in the step length.

It's important to note that even with a high variation your network can give you improve in your prediction accuracy after each iteration.

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Best Answer

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Related Question