Solved – Input layer batch normalization

batch normalizationneural networks

If we apply batch normalization to the input layer, is the resulting (trained) network equivalent to the same network without batch normalization with input standardization wrt biased mean and variance estimators?

Of course, the estimates vary during training, so the ending weights won't be the same, but if my intuition is correct then would it be reasonable to use a batch norm layer instead of standardization for pre-processing?

Best Answer

It is unlikely they will be the same, since batch normalization has a gamma and beta variable on top of the normalization process. In the paper, it is mentioned that this gamma and beta is used for scaling and shifting the activations to an appropriate degree in order to correct represent the data. Here is from the paper:

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation x (k) , a pair of parameters γ (k) , β(k) , which scale and shift the normalized value

Like what the paper said, using sigmoid as an example, with just normalization, your gradients will likely to be always around the linear regime of the sigmoid function (i.e. one of the highest gradients), which may not be optimal for the model learning. However, without such scaling and shifting of the value after normalization, it is likely the learning will be restricted to only this area. I think of it as a way of shifting the activation values properly just like using biases, but in a way that is more effective (esp. since direct manipulation of biases is seen to be ineffective, as mentioned in the paper).

Related Question