Solved – Understanding batch normalization

batch normalizationdeep learningneural networks

In the paper Batch Normalization: Accelerating Deep Network Training b
y Reducing Internal Covariate Shift (here) Before explaining the process of batch normalization the paper tries to explain the issues related with (I am not getting what the exact issue addressed here is).

excerpt from section 2, para 2:

We could consider whitening activations at every training
step or at some interval, either by modifying the
network directly or by changing the parameters of the
optimization algorithm to depend on the network activation
values (Wiesler et al., 2014; Raiko et al., 2012;
Povey et al., 2014; Desjardins & Kavukcuoglu). However,
if these modifications are interspersed with the optimization
steps, then the gradient descent step may attempt
to update the parameters in a way that requires
the normalization to be updated, which reduces the effect
of the gradient step. For example, consider a layer
with the input u that adds the learned bias $b$, and normalizes
the result by subtracting the mean of the activation
computed over the training data: $\hat x= x − E[x]$ where
$x = u + b, X = {x_{1…N}}$ is the set of values of $x$ over
the training set, and $E[x] = \frac 1 N(\sum_{i=1}^nx_i)$
.

If a gradient
descent step ignores the dependence of E[x] on b, then it
will update $b ← b + ∆b$, where $∆b ∝ −\partial l/\partial\hat x$. Then
$$u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b] \tag 1$$.

Thus, the combination of the update to b and subsequent
change in normalization led to no change in the output
of the layer nor, consequently, the loss. As the training
continues, b will grow indefinitely while the loss remains
fixed. This problem can get worse if the normalization not
only centers but also scales the activations.

here is my understanding of the literature:

  1. We have a batch of size N (One training batch)

  2. Let there be two arbitrary hidden layer connected to each other (L1 and L2) connected by parameters $W$ and $b$

  3. output coming out of L1 is x1

  4. $u = x1W$ (this is where the literature above starts. dimension of u is MxN)(M is the number of units in L2)

  5. $x = u+b$ (dimension b = dimension x = dimension u = MxN)

  6. Now before feeding x into L2 we centre it by subtracting the mean of $x$ from each entry in $x$ ($\hat x= x − E[x]$)

  7. We compute the loss and backpropogate the gradient and update just this layer's $b$ in order to give it a sanity test. New $b$ = $b + \Delta b$

  8. We run it again on the same batch with updated $b$

  9. repeat 3 and 4

  10. $x_{new} = u+b + \Delta b$ (dimension b, $\Delta b$ = dimension x = dimension u = MxN)

  11. Now before feeding x into L2 we centre it by subtracting the mean of $x$ from each entry in $x$ ($\hat x = x + \Delta b − E[x + \Delta b] = x – E[x]$).
    which is the same as what was calculated before updating b and hence updating b had to effect on the training

My question is with this part of the excerpt:

"If a gradient
descent step ignores the dependence of E[x] on b, then it
will update $b ← b + ∆b$, where $∆b ∝ −\partial l/\partial\hat x$. Then
$$u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b] \tag 1$$."

Why is

"
$$u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b] \tag 1$$." dependent upon what comes before it? What is even the point of that bit? Please also note the useage of word "Then" (made bold) implying the statement necessarily draws causality from what comes before

Best Answer

Their notation is confusing. For instance, they use the hat symbol to denote the normalized variable $\hat x=x-E[x]$, then even worse they use the expectation symbol to denote what is clearly a sample average estimator: $E[x]=\frac 1 N\sum_{i=1}^Nx_i$

I assume that when you wrote $E[b]=b$, you mean the expectation of bias. That's not what they would denote by $E[b]$: they'd mean a simple sample average estimator, the quantity that is usually denoted as $\bar b=\frac 1 N\sum_jb_j$, where $b_j$ is the bias learned from a batch $j$.

All they're saying in this paragraph is that you have to be mindful of how you implement normalization, because if you do it wrong then it will interfere with the gradient descent. Their example is first learning the bias, then normalizing. So, you learn the bias $\Delta b$ from a batch, but then when you normalize subsequently you negate the learning by subtracting what you learned.

It goes like this. First you learned $\Delta b$, which means you are prepared to output $u+b+\Delta b$. However, you squeezed another operation just before outputting from the layer: you decided to normalize. So, you subtract the sample average, i.e. what they denote unfortunately by $E[u+b+\Delta b]$. This will lead to $E[\Delta b]$ cancelling each other in gradient descent and normalization, i.e.e you didn't learn anything.