Fix-up Initialization – How It Prevents Neurons from Updating in the Same Direction

We know zero initialization is bad:

Pitfall: all zero initialization. Lets start with what we should not do. Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which we expect to be the “best guess” in expectation. This turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

However, in Open AI's "Improved Denoised Diffusion Models", we see the following code inside of a ResBlock class:

    self.out_layers = nn.Sequential(
        normalization(self.out_channels),
        SiLU(),
        # Normally 0
        nn.Dropout(p=dropout),
        zero_module(
            conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)
        ),
    )

where zero_module initializes the weights of a module to 0.
The authors are using a trick from Fixup Initialization

Fixup initialization (or: How to train a deep residual network without normalization)

Initialize the classification layer and the last layer of each residual branch to 0.

Initialize every other layer using a standard method (e.g., Kaiming He), and scale > only the weight layers inside residual branches by … .

Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.

I don't understand how this trick works. It seems like the "there is no asymmetry" problem will still exist since all the initial weights are 0. Adding a scalar/bias before/after the convolution doesn't seem to help.

This is a semi-duplicate of this question: Fixup initialisation for residual networks, which was answered, however the answer gave 0 information.

Best Answer

Constant initialization is a problem with fully connected layers because every neuron is connected to every other neuron. Each neuron gets the exact same input data. So if all weights are the same, then each neuron computes the exact same output.

Convolutional layers are not fully connected and the locality of convolutional filters means that different "neurons" get different input data. So even if the weights are the same, the output can be different.

For zero initialization, residual blocks return $x + f(x)$ instead of just $f(x)$. So with a typical convolutional layer, zero initialization would result in zero output. But with residual blocks, zero initialization results in the identity function, which keeps the asymmetry of the input.

Best Answer

Related Solutions

Solved – Why do residual networks work

Solved – Why does each convolution layer require activation function and weight initialization

Related Question