Solved – Why does batch norm have learnable scale and shift

batch normalization

As far as I understand it, batch norm normalises all the input features to a layer to a unit normal distribution, $\mathcal{N}(\mu=0,\sigma=1)$. The mean and variance $\mu, \sigma^2$ are estimated by measuring their values for the current mini-batch.

After the normalisation the inputs are scaled and shifted by scalar values:

$$\hat{x}_i' = \gamma \hat{x}_i + \beta$$

(Correct me if I'm wrong here – this is where I start to get a bit unsure.)

$\gamma$ and $\beta$ are scalar values and there is a pair of each for every batch-normed layer. They are learnt along with the weights using backprop and SGD.

My question is, aren't these parameters redundant because the inputs can be scaled and shifted in any way by the weights in the layer itself. In other words, if

$$y = W \hat{x}' + b$$

and

$$\hat{x}' = \gamma \hat{x} + \beta$$

then

$$y = W' \hat{x} + b'$$

where $W' = W\gamma$ and $b'=W\beta + b$.

So what is the point of adding them of the network is already capable of learning the scale and shift? Or am I totally misunderstanding things?

Best Answer

There is a perfect answer in the Deep Learning Book, Section 8.7.1:

Normalizing the mean and standard deviation of a unit can reduce the expressive power of the neural network containing that unit. To maintain the expressive power of the network, it is common to replace the batch of hidden unit activations H with γH+β rather than simply the normalized H. The variables γ and β are learned parameters that allow the new variable to have any mean and standard deviation. At first glance, this may seem useless — why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β?

The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.

Related Question