Solved – How to initialize the elements of the filter matrix

conv-neural-networkdeep learningfeature-engineeringmachine learningneural networks

I'm trying to better understand convolutional neural networks better by writing up Python code that doesn't depend on libraries (like Convnet or TensorFlow), and I'm getting stuck in the literature on how to choose values for the kernel matrix, when performing a convolution on an image.

I'm trying to understand the implementation details in the step between feature maps in the image below showing the layers of a CNN.

According to this diagram:

The kernel matrix kernel "steps" over the image, creating a feature map, where each pixel is the sum of all element-wise products between each weight of the kernel (or filter matrix) and the corresponding pixel value of the input image.

My question is: how do we initialize the weights of the kernel (or filter) matrix?

In the demonstration above, they are simply 1s and 0s, but I assume this is simplified from the diagram's sake.

Are these weights trained in some preprocessing step? Or chosen explicitly by the user?

Best Answer

One typically initializes a network from a random distribution, typically mean zero and some care is taken with regards to choosing its variance. These days with advances in optimization techniques (SGD+Momentum among other methods) and activation nonlinearities (ReLUs and ReLU-like activations allow for better backproagation of gradient signals, even in deeper networks), one is able to actually train state of the art convolutional neural networks from a randomized initialization.

Key properties are the following:

Why random? Why not initialize them all to 0? An important concept here is called symmetry breaking. If all the neurons have the same weights, they will produce the same outputs and we won't be learning different features. We won't learn different features because during the backpropagation step, all the weight updates will be exactly the same. So starting with a randomized distribution allows us to initialize the neurons to be different (with very high probability) and allows us to learn a rich and diverse feature hierarchy.
Why mean zero? A common practice in machine learning is to zero-center or normalize the input data, such that the raw input features (for image data these would be pixels) average to zero.

We zero-centered our data, and we will randomly initialize our network's weights (matrices as you referred to them). What sort of distribution should we choose? The distribution of the input data to our network has mean zero since we zero-centered. Say we also initialize our bias terms to be zero. When we initialize the training of our network, we have no reason to favor one neuron over the other as they are all random. One practice is to randomly initialize our weights in a way where they all have zero activation output in expectation. This way no one neuron is favored to "activate" (have positive output value) than any other neuron while simultaneously breaking symmetry due to the random initialization. Well a simple way to accomplish this is to choose a mean zero distribution.
How do we choose the variances? You don't want to choose the variance to be too large, even if it is mean zero. Extreme values in a deep nets weights can result in activation outputs that are exponentially increasing in magnitude, and this issue can compound with the depth of the network. This can wreak havoc on the training of our network. You also don't want to choose it to be too small as this may slow down learning since we are computing very small gradient values. So there's a balance here, especially when it comes to deeper networks as we do not want our forward or backward propagations to exponentially increase or decrease in depth.

There are two very popular weight initialization schemes: Glorot Uniform (Understanding the difficulty of training deep feedforward neural networks) and the He Normal initializer (Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification).

They are both constructed with the intent of training deep networks with the following core principle in mind (quote is from the Delving Deeper into Rectifiers article):

"A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially."

Roughly speaking, these two initialization schemes initialize the variance of each layer so that the output distribution of each neuron is the same. Section 2.2 of the Delving Deep into Rectifiers provides an in-depth analysis.

A final note: sometimes you will also see people use Gaussian with standard deviation equal to .005 or .01, or some other "small" standard deviation, across all the layers. Other times you will see people fiddle with the variances by hand, basically performing cross validation to find a best performing configuration.

Best Answer

Related Solutions

Deep Learning – How are Kernels Applied to Feature Maps to Produce Other Feature Maps?

CNNs – What the Convolution Step Does in a Convolutional Neural Network

Related Question