Layer Normalization – Understanding cs231n Analogy of Layer Normalization

data preprocessingnormalization

In Assignment 2 of CS231n, one of the question asks "Which of these data pre-processing steps is analogous to batch normalization, and which is analogous to layer normalization?"

One of the options is "Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1."

Intuitively, it makes sense that this option is analogous to layer normalization. However, I would like to have a concrete value for $\gamma $ (scale parameter) and $\beta $ (shift parameter) that can satisfy the scenario above.

Notation wise, let $x$ denote the input matrix. Then, the normalized input matrix is $\hat{x}=\frac{x-\mu}{\sigma}$ where $\mu$ and $\sigma$ denote the sample mean and sample variance vectors respectively (Implicit python broadcasting). Lastly, the output matrix of layer normalization is $y=\gamma\hat{x}+\beta$ where $\gamma$ is the scale parameter vector and $\beta$ is the shift parameter vector.

Best Answer

I don't think the question means to concretely find values that are equivalent to Batch/Layer Normalization but on which dimension the normalization is done upon.

However, to answer your question, the scale and shift parameters are trained parameters that are used to essentially allow the network to set its own $μ$ and $σ$ for that layer (instead of using $μ=0$ and $σ=1$). As stated by the authors, if you were to set $γ = \sqrt{Var[x]}$ and $β = Ε[x]$ you would get the original normalized vales of $x$.


Now, on to the question:

Assume you have a batch of $N=32$ RGB images ($C=3$) , with a resolution of $H \times W = 256 \times 256$. This would mean that your batch would have a shape of $(N, H, W, C) = (32, 256, 256, 3)$.

Batch normalization is performed on the channel dimension (i.e. $C$). The equivalent would be to scale each of the $3$ channels independently so that they have zero mean and unit standard deviation. So for each channel batch norm computes $μ$ and $σ$ along the $(N, H, W)$ axes. The result is that for each of the $c \in C$ channels $x_c$:

$$ \frac{1}{N \cdot H \cdot W}\sum_{n=1}^N\sum_{h=1}^H\sum_{w=1}^W{x_{n, h, w, c}} = 0 $$

Layer normalization, on the other hand is performed on the batch dimension (i.e. $N$). The equivalent of this would be scaling each of the $32$ images independently to have a zero mean and a unit standard deviation. This means that layer norm computes the $μ$ and $σ$ along the $(H, W, C)$ axes. The result is that for each of the $n \in N$ images $x_n$:

$$ \frac{1}{H \cdot W \cdot C}\sum_{h=1}^H\sum_{w=1}^W\sum_{c=1}^C{x_{n, h, w, c}} = 0 $$

The image below might make the operations more clear dimension-wise.

Now, on to the question:

Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1

First of all by saying "scaling each image" and afterwards "an image", it clearly states that it treats the images independently when normalizing (so Batch normalization is out of the question).

The second part of the question, I'm having trouble of understanding: does it mean simply that the sum of all pixels in an image are equal to $1$?

$$ \sum_{h=1}^H\sum_{w=1}^W\sum_{c=1}^C{x_{n, h, w, c}} = 1 $$

This is equivalent to layer normalization.

However, the most likely scenario is that it means that the R channel pixels sum up to $1$, the C channel pixels sum up to $1$ and the G channel pixels sum up to $1$?

$$ \sum_{h=1}^H\sum_{w=1}^W{x_{n, h, w, c}} = 1 $$

This actually is another case of normalization called instance normalization, which teats both images (i.e. $N$) and channels (i.e. $C$) independently. Here, $μ$ and $σ$ are computed along the $(H, W)$ axes.

Related Question