Solved – Why does batch normalization use mini-batch statistics instead of the moving averages during training

batch normalizationdeep learningmachine learningneural networks

The traditional approach to batch normalization is to estimate the mean and variance from the batch and use it to normalize the data at different layers, while keeping moving averages that you should later use at test/prediction time. My question is, wouldn't it be better to use the moving averages at training time too?

Of course, it would be worse at the very beginning, but if you use, for example, an exponential moving average with a small initial decay (you can increase it later) the moving average will be okay after a few mini-batches. And then, if you get a mini-batch that, just by chance, is further than usual from the average, wouldn't you rather train using the same average that you'll have at test time?

The extreme case would obviously be an online learning setup with one example per batch; basically every single example would turn to zero at training time, but not at test time.

Best Answer

There is a follow-up paper by Sergej Ioffe (i.e. Batch Renormalization) which discusses this issue:

https://arxiv.org/abs/1702.03275

A quote from that paper regarding regular batch normalization:

It is natural to ask whether we could simply use the moving averages $\mu, \sigma$ to perform the normalization during training, since this would remove the dependence of the normalized activations on the other example in the mini-batch. This, however, has been observerd to lead to the model blowing up. As argued in [6, the original batch norm paper], such use of moving averages would cause the gradient optimization and the normalization to counteract each other.

In that paper the this issue is fixed by introducing an additional affine transformation from the batch statistics to the moving average statistics, the coefficients of which are treated as constants by the optimisation. With this change the moving average can be used both during training and testing time.