Neural Networks – Dimensions of Scale (Gamma) and Offset (Beta) in Batch Norm

batch normalizationneural networks

While implementing Batch Normalization for a particular layer 'L' with 'n' hidden neurons/units in a Neural Network, we first normalize the Activation values of that layer using their respective Mean and Standard Deviation, and then apply the Scaling and Offset factor as shown:

X-norm = (X – mu)/sd
X' = (Y * X-norm) + B

where

mu = mean of X and it is a (n,1) vector
sd = standard deviation of X and it is also a (n,1) vector
X = Activation values of layer 'L' with dimension (n,m) if mini-batch
size = m
X-norm = normalized X with dimension (n,m)
Y = Gamma / Scaling factor
B = Beta / Offset factor

Now my question is, what are the dimensions of Gamma and Beta ? Are they (n,1) vectors or are they (n,m) matrices ? My intuition says that since they somewhat are analogous to the Mean and Standard Deviation, they should be (n,1) vectors.

Best Answer

The symbols $\gamma, \beta$ are $n$-vectors because there is a scalar $\gamma^{(k)}, \beta^{(k)}$ parameter for each input $x^{(k)}$.

From the batch norm paper:

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation $x^{(k)}$, a pair of parameters $\gamma^{(k)}, \beta^{(k)}$, which scale and shift the normalized value: $$ y^{(k)} = \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)}. $$ These parameters are learned along with the original model parameters, and restore the representation power of the network. Indeed, by setting $\gamma^{(k)} = \sqrt{\text{Var}\left[x^{(k)}\right]}$ and $\beta^{(k)} = \mathbb{E}\left[x^{(k)}\right]$, we could recover the original activations, if that were the optimal thing to do.

Emphasis mine.

"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Sergey Ioffe, Christian Szegedy

Related Question