Solved – an explanation of the example of why batch normalization has to be done with some care

I was reading the batch normalization paper[1] and it had one section where is goes through an example, trying to show why normalization has to be done carefully. I honestly, can't understand how the example works and I am genuinely very curious to understand they paper as much as I can. First let me quote it here:

For example, consider a layer with the input u that adds the learned
bias b, and normalizes the result by subtracting the mean of the
activation computed over the training data: $\hat{x} = x − E[x]$ where
$x=u+b, X =\{x_1…N \}$ is the set of values of $x$ over the training
set, and $E[x] = \sum^N_{i=1} x_i$. If a gradient descent step ignores
the dependence of $ E[x] $ on $b$, then it will update $b ← b + \Delta
> b$, where $\Delta b \propto -\frac{\partial l}{\partial \hat{x}} $.
Then $u+(b+\Delta b)−E[u+(b+\Delta b)] = u+b−E[u+b]$. Thus, the
combination of the update to $b$ and subsequent change in
normalization led to no change in the output of the layer nor,
consequently, the loss.

I think I understand the message, that if one does not do normalization properly, it can be bad. I just don't how the example they are using portrays this.

I am aware that its difficult to help someone if they are not more specific on what is confusing them so I will provide on the next section, the things that are confusing me about their explanation.

I think most of my confusions might be notational, so I will clarify.

First, I think one of the things that is confusing me a lot is what it means for the authors to have a unit in the network and what an activation is. Usually, I think of an activation as:

$$ x^{(l)} = a^{(l)} = \theta(z^{(l)}) = \theta( \langle w^{(l)}, x^{(l-1)} \rangle + b^{(l)})$$

where $x^{(0)} = a^{(0)} = x $ is the raw feature vectors from the first input layer.

Also, I think one of first thing that confuses me (due to the previous reason) is what the scenario they are trying to explaining really is. It says:

normalizes the result by subtracting the mean of the
activation computed over the training data: $\hat{x} = x − E[x]$ where $x=u+b$

I think what they are trying to say is that instead of using the activations $x^{(l)} = a^{(l)}$ as computed by the forward pass, one performs some kind of "normalization" by subtracting the mean activation:

$$\bar{x}^{l} = \bar{a}^{l} = \frac{1}{N} \sum^{N}_{i=1} \bar{a}^{l} = \frac{1}{N} \sum^{N}_{i=1} \bar{x}^{l} $$

and then passes that to the back-propagation algorithm. Or at least thats what would make sense to me.

Relating to this, I guess what they call $u$ is maybe $x^{(l)}$? Thats what I would guess because they call it "input" and have the equation $x = u + b$ (I guess they are using the identity/linear activation unit for their neural network? maybe).

To further confuse me, they define $\Delta b$ as something proportional to the partial derivative, but the partial derivative is computed with respect to $\hat{x}$, which seems really bizarre to me. Usually, the partial derivatives when using gradient descent are with respect to the parameters of the network. In the case of an offset, I would have thought:

$$ \Delta b^{(l)} \propto -\frac{\partial l}{\partial b^{(l)} } $$

makes more sense rather than taking the derivative of with respect to the normalized activations. I was trying to understand why they'd take the derivative with respect to $\hat{x}$ and I thought maybe they were referring to the deltas when they wrote $\frac{ \partial l }{ \partial \hat{x} }$ since usually that is the only part of the back-prop algorithm that has a derivative with respect to pre-activations since the equation of delta is:

$$ \delta^{(l)}_j = \frac{\partial L}{\partial z^{(l)}_j}$$

Another thing that confuses me is :

Then $u + (b + \Delta b) – E[u + (b + \Delta b)] = u + b – E[u + b]$.

they don't really say what they are trying to compute in the above equation but I would infer that they are trying to compute the updated normalized activation (for the first layer?) after $b$ is updated to $b + \Delta b$? Not sure if I buy their point because I think the correct equation should have been:

$$\hat{x} = \theta( u + (b + \Delta b) ) – E[\theta( u + (b + \Delta b) )] $$

which doesn't cancel $\Delta b$ the change in the parameter $b$. However, I don't really know what they are doing so I am just guessing. What exactly is that equation that they have written?

I am not sure if this is the right understanding but I've given some thought to their example. It seems that their example has no non-linear activation unit (uses the identity) and they are talking about the first input layer only? Since they left out a lot of the details and the notation isn't very clear I can't deduce exactly what they are talking about. Does someone know how to express this example with notation that expresses whats going on at each layer? Does someone understand what is actually going on with that example and want to share their wisdom with me?

[1]: Ioffe S. and Szegedy C. (2015),
"Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift",
Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.
Journal of Machine Learning Research: W&CP volume 37

Best Answer

I think the whole point of this paragraph is, if a gradient descent step ignores the dependence of $E[x]$ on $b$, updating the bias term b will lead to no change in the output, as claimed in the sentence before it,

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.

Therefore they made the gradient descent step aware of the normalization in their method.

Regarding you questions

Relating to this, I guess what they call $u$ is maybe $x^{(l)}$?

As claimed in their first sentence, $u$ is the input of the layer. What $u$ actually is doesn't seem to matter, as they're illustrating only the effect of $b$ in the example.

I would have thought $ \Delta b \propto -\frac{\partial l}{\partial b } $ makes more sense rather than taking the derivative of with respect to the normalized activations.

We know $\hat{x}=x-E[x]=u+b-E[x]$, as we are ignoring the dependence of $E[x]$ on $b$, we have $$\frac{\partial l}{\partial b}=\frac{\partial l}{\partial \hat{x}}\frac{\partial \hat{x}}{\partial b} = \frac{\partial l}{\partial \hat{x}},$$ so $\Delta b \propto -\frac{\partial l}{\partial \hat{x}}$.

$u + (b + \Delta b) - E[u + (b + \Delta b)] = u + b - E[u + b]$ they don't really say what they are trying to compute in the above equation but I would infer that they are trying to compute the updated normalized activation (for the first layer?) after $b$ is updated to $b+\Delta b$?

It is computing the $\hat{x}$ after $b$ is updated to $b+\Delta b$, to show that if a gradient descent step ignores the dependence of $E[x]$ on $b$, updating the bias term b will lead to no change in the output.

It might be helpful to take a look at some open source implementations of batch normalization, for example in Lasagne and Keras.

There's another question that might seem related, Why take the gradient of the moments (mean and variance) when using Batch Normalization in a Neural Network?

Best Answer

Related Solutions

Solved – What does MatConvNet do with Batch Normalization during testing and inference

Solved – Why take the gradient of the moments (mean and variance) when using Batch Normalization in a Neural Network

Related Question