Solved – What does MatConvNet do with Batch Normalization during testing and inference

batch normalizationconv-neural-networkdeep learningmachine learningneural networks

I was reading the documentation for evaluating a simple CNN and it said:

In test mode, dropout and batch-normalization are bypassed. Note that,
when a network is deployed, it may be preferable to remove such blocks
altogether.

where it suggests to remove a batch normalization layer during training (lets ignore drop out because that is not relevant).

This seems very strange to me and I've grown a bit skeptical of this.

The reason for this is that the original paper (1) says:

The normalization of activations that depends on the mini-batch allows
efficient training, but is neither necessary nor desirable during
inference; we want the output to depend only on the input,
deterministically. For this, once the network has been trained, we use
the normalization $$\hat{x} = \frac{x – E[x]}{ \sqrt{Var[x] + \epsilon}}$$ using the population,
rather than mini-batch, statistics.

Which clearly seems to me that batch normalization still produces normalized activations $\hat{x}_i$ during inference, except that uses the population statistics $\mu$, $\sigma$ instead of mini batch estimates.

Furthermore, at the end of the pseudocode on the paper they suggest to replace the batch normalization transform with:

$$ y^{(k)} = \frac{\gamma^{(k)}}{\sqrt{Var[x] + \epsilon}} \hat{x}^{(k)} + \left( \beta^{(k)} – \frac{\gamma^{(k)} E[x]}{\sqrt{Var[x] + \epsilon }} \right)$$

with these two suggestions in the original paper I've grown very skeptical that following what MatConvNet suggests is a reasonable thing to do. Furthermore, if the network is trained using the BN layers, it seems that the all parameters are trained assuming those layers are part of the network, which intuitively, seems like a bad suggestion to remove the layers all together during inference or testing. Is that what MatConvNet suggests? Or do I have a misunderstanding? How exactly should one use Batch Normalization with MatConvNet?


1: Ioffe S. and Szegedy C. (2015),
"Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift",
Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.
Journal of Machine Learning Research: W&CP volume 37

Best Answer

What is meant by a batch normalisation being bypassed is that it does not normalise the activations by batch statistics.

In the newer versions (since beta-18 I believe), the population statistics are computed during training as another parameter and then used during test time (e.g. see documentation here and here).

What is meant by removing the batch-normalisation is to apply the additive and multiplicative constants to the closest convolution layer. You can see the way how it can be done in the cnn-imagenet-deploy script in examples for imagenet.

Sorry for the misunderstanding. We will update the documentation to make it more clear.