Solved – Matrix form of backpropagation with batch normalization

backpropagationbatch normalizationdeep learningmachine learningneural networks

Batch normalization has been credited with substantial performance improvements in deep neural nets. Plenty of material on the internet shows how to implement it on an activation-by-activation basis. I've already implemented backprop using matrix algebra, and given that I'm working in high-level languages (while relying on Rcpp (and eventually GPU's) for dense matrix multiplication), ripping everything out and resorting to for-loops would probably slow my code substantially, in addition to being a huge pain.

The batch normalization function is
$$
b(x_p) = \gamma \left(x_p – \mu_{x_p}\right) \sigma^{-1}_{x_p} + \beta
$$
where

  • $x_p$ is the $p$th node, before it gets activated
  • $\gamma$ and $\beta$ are scalar parameters
  • $\mu_{x_p}$ and $\sigma_{x_p}$ are the mean and SD of $x_p$. (Note that the square root of of the variance plus a fudge factor is normally used — let's assume nonzero elements for compactness)

In matrix form, batch normalization for a whole layer would be
$$
b(\mathbf{X}) = \left(\gamma\otimes\mathbf{1}_p\right)\odot \left(\mathbf{X} – \mu_{\mathbf{X}}\right) \odot\sigma^{-1}_{\mathbf{X}} + \left(\beta\otimes\mathbf{1}_p\right)
$$
where

  • $\mathbf{X}$ is $N\times p$
  • $\mathbf{1}_N$ is a column vector of ones
  • $\gamma$ and $\beta$ are now row $p$-vectors of the per-layer normalization parameters
  • $\mu_{\mathbf{X}}$ and $\sigma_{\mathbf{X}}$ are $N \times p$ matrices, where each column is a $N$-vector of columnwise means and standard deviations
  • $\otimes$ is the Kronecker product and $\odot$ is the elementwise (Hadamard) product

A very simple one-layer neural net with no batch normalization and a continuous outcome is
$$
y = a\left(\mathbf{X\Gamma}_1\right)\Gamma_2 + \epsilon
$$

where

  • $\Gamma_1$ is $p_1 \times p_2$
  • $\Gamma_2$ is $p_2 \times 1$
  • $a(.)$ is the activation function

If the loss is $R = N^{-1}\displaystyle\sum\left(y – \hat{y}\right)^2$, then the gradients are
$$
\begin{array}{lr}
\frac{\partial R}{\partial \Gamma_1} = -2\mathbf{V}^T \hat\epsilon\\
\frac{\partial R}{\partial \Gamma_2} = \mathbf{X}^T \left(a'(\mathbf{X}\mathbf{\Gamma}_1) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T\right) \\
\end{array}
$$

where

  • $\mathbf{V} = a\left(\mathbf{X}\Gamma_1\right)$
  • $\hat{\epsilon} = y-\hat{y}$

Under batch normalization, the net becomes
$$
y = a\left(b\left(\mathbf{X}\Gamma_1\right)\right)\Gamma_2
$$
or
$$
y = a\Big(\left(\gamma\otimes\mathbf{1}_N\right)\odot \left(\mathbf{X\Gamma_1} – \mu_{\mathbf{X\Gamma_1}}\right) \odot\sigma^{-1}_{\mathbf{X\Gamma_1}} + \left(\beta\otimes\mathbf{1}_N\right)\Big)\mathbf{\Gamma_2}
$$
I have no idea how to compute the derivatives of Hadamard and Kronecker products. On the subject of Kronecker products, the literature gets fairly arcane.

Is there a practical way of computing $\partial R/\partial \gamma$, $\partial R/\partial \beta$, and $\partial R/\partial \mathbf{\Gamma_1}$ within the matrix framework? A simple expression, without resorting to node-by-node computation?

Update 1:

I've figured out $\partial R/\partial \beta$ — sort of. It is:
$$
\mathbf{1}_{N}^T \left(a'(\mathbf{X}\mathbf{\Gamma}_1) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T\right)
$$
Some R code demonstrates that this is equivalent to the looping way to do it. First set up the fake data:

set.seed(1)
library(dplyr)
library(foreach)

#numbers of obs, variables, and hidden layers
N <- 10
p1 <- 7
p2 <- 4
a <- function (v) {
  v[v < 0] <- 0
  v
}
ap <- function (v) {
  v[v < 0] <- 0
  v[v >= 0] <- 1
  v
}

# parameters
G1 <- matrix(rnorm(p1*p2), nrow = p1)
G2 <- rnorm(p2)
gamma <- 1:p2+1
beta <- (1:p2+1)*-1
# error
u <- rnorm(10)

# matrix batch norm function
b <- function(x, bet = beta, gam = gamma){
  xs <- scale(x)
  gk <- t(matrix(gam)) %x% matrix(rep(1, N))
  bk <- t(matrix(bet)) %x% matrix(rep(1, N))
  gk*xs+bk
}
# activation-wise batch norm function
bi <- function(x, i){
  xs <- scale(x)
  gk <- t(matrix(gamma[i]))
  bk <- t(matrix(beta[i]))
  suppressWarnings(gk*xs[,i]+bk)
}

X <- round(runif(N*p1, -5, 5)) %>% matrix(nrow = N)
# the neural net
y <- a(b(X %*% G1)) %*% G2 + u

Then compute derivatives:

# drdbeta -- the matrix way
drdb <- matrix(rep(1, N*1), nrow = 1) %*% (-2*u %*% t(G2) * ap(b(X%*%G1)))
drdb
           [,1]      [,2]    [,3]        [,4]
[1,] -0.4460901 0.3899186 1.26758 -0.09589582
# the looping way
foreach(i = 1:4, .combine = c) %do%{
  sum(-2*u*matrix(ap(bi(X[,i, drop = FALSE]%*%G1[i,], i)))*G2[i])
}
[1] -0.44609015  0.38991862  1.26758024 -0.09589582

They match. But I'm still confused, because I don't really know why this works. The MatCalc notes referenced by @Mark L. Stone say that the derivative of $\beta \otimes \mathbf{1}_N$ should be

$$
\frac{\partial A \otimes B}{\partial A} = \left(I_{nq} \otimes T_{mp}\right)\left(I_n\otimes vec(B) \otimes I_m\right)
$$
where the subscripts $m$, $n$, and $p$, $q$ are the dimensions of $A$ and $B$. $T$ is the commutation matrix, which is just 1 here because both inputs are vectors. I try this and get a result that doesn't seem helpful:

# playing with the kroneker derivative rule
A <- t(matrix(beta)) 
B <- matrix(rep(1, N))
diag(rep(1, ncol(A) *ncol(B))) %*% diag(rep(1, ncol(A))) %x% (B) %x% diag(nrow(A))
     [,1] [,2] [,3] [,4]
 [1,]    1    0    0    0
 [2,]    1    0    0    0
 snip
[13,]    0    1    0    0
[14,]    0    1    0    0
snip
[28,]    0    0    1    0
[29,]    0    0    1    0
[snip
[39,]    0    0    0    1
[40,]    0    0    0    1

This isn't conformable. Clearly I'm not understanding those Kronecker derivative rules. Help with those would be great. I'm still totally stuck on the other derivatives, for $\gamma$ and $\mathbf{\Gamma_1}$ — those are harder because they don't enter additively like $\beta \otimes \mathbf{1}$ does.

Update 2

Reading textbooks, I'm fairly sure that $\partial R/\partial \Gamma_1$ and $\partial R/\partial \gamma$ will require use of the vec() operator. But I'm apparently unable to follow the derivations sufficiently as to be able to translate them into code. For example, $\partial R/\partial \Gamma_1$ is going to involve taking the derivative of $w\odot\mathbf{X\Gamma_1}$ with respect to $\mathbf{\Gamma_1}$, where $w \equiv (\gamma \otimes \mathbf{1}) \odot \sigma_{\mathbf{X\Gamma_1}}^{-1}$ (which we can treat as a constant matrix for the moment).

My instinct is to simply say "the answer is $w\odot\mathbf{X}$", but that obviously doesn't work because $w$ isn't conformable with $\mathbf{X}$.

I know that
$$
\partial(A \odot B) = \partial A \odot B + A \odot \partial B
$$

and from this, that

$$
\frac{\partial vec(w \odot \mathbf{X\Gamma_1})}{\partial vec(\mathbf{\Gamma_1})^T} = vec(\mathbf{X\Gamma_1})I\frac{\partial vec(w)}{\partial vec(\mathbf{\Gamma_1})^T} + vec(w)I\frac{\partial vec(\mathbf{X\Gamma_1})}{\partial vec(\mathbf{\Gamma_1})^T}
$$
But I'm uncertain how to evaluate this, let alone code it.

Update 3

Making progress here. I woke up at 2AM last night with this idea. Math is not good for sleep.

Here is $\partial R/\partial \mathbf{\Gamma_1}$, after some notational sugar:

  • $w \equiv (\gamma \otimes \mathbf{1}) \odot \sigma_{\mathbf{X\Gamma_1}}^{-1}$
  • $\text{"stub"} \equiv a'(b(\mathbf{X\Gamma}_1)) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T$

Here's what you have after you get to the end of the chain rule:
$$
\frac{\partial R}{\partial \Gamma_1} = \frac{\partial w \odot \mathbf{X\Gamma}_1}{\partial \Gamma_1}\left(\text{"stub"}\right)
$$
Start by doing this the looping way — $i$ and $j$ will subscript columns and $\mathbf{I}$ is a conformable identity matrix:
$$
\frac{\partial R}{\partial \Gamma_{ij}} = \left(w_i \odot \mathbf{X_i}\right)^T\left(\text{"stub"}_j\right)
$$
$$
\frac{\partial R}{\partial \Gamma_{ij}} = \left(\mathbf{I} w_i \mathbf{X_i}\right)^T\left(\text{"stub"}_j\right)
$$
$$
\frac{\partial R}{\partial \Gamma_{ij}} = \mathbf{X_i}^T\mathbf{I} w_i\left(\text{"stub"}_j\right)
$$
tl;dr you're basically pre-multiplying the stub by the batchnorm scale factors. This should be equivalent to:
$$
\frac{\partial R}{\partial \Gamma} = \mathbf{X}^T\left(\text{"stub"}\odot w\right)
$$

And, in fact it is:

stub <- (-2*u %*% t(G2) * ap(b(X%*%G1)))
w <- t(matrix(gamma)) %x% matrix(rep(1, N)) * (apply(X%*%G1, 2, sd) %>% t %x% matrix(rep(1, N)))
drdG1 <- t(X) %*% (stub*w)

loop_drdG1 <- drdG1*NA
for (i in 1:7){
  for (j in 1:4){
    loop_drdG1[i,j] <- t(X[,i]) %*% diag(w[,j]) %*% (stub[,j])
  }
}

> loop_drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965
> drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965

Update 4

Here, I think, is $\partial R / \partial \gamma$. First

  • $\widetilde{\mathbf{X\Gamma}} \equiv \left(\mathbf{X\Gamma} – \mu_{\mathbf{X\Gamma}}\right)\odot \sigma^{-1}_\mathbf{X\Gamma}$
  • $\tilde\gamma \equiv \gamma \otimes\mathbf{1}_N$

Similar to before, the chain rule gets you as far as
$$
\frac{\partial R}{\partial \tilde\gamma} = \frac{\partial \tilde\gamma \odot \widetilde{\mathbf{X\Gamma}}}{\partial \tilde\gamma}\left(\text{"stub"}\right)
$$
Looping gives you
$$
\frac{\partial R}{\partial \tilde\gamma_i} = (\widetilde{\mathbf{X\Gamma}})_i^T \mathbf{I}\tilde\gamma_i \left(\text{"stub"}_i\right)
$$
Which, like before, is basically pre-multiplying the stub. It should therefore be equivalent to:
$$
\frac{\partial R}{\partial \tilde\gamma} = (\widetilde{\mathbf{X\Gamma}})^T \left(\text{"stub"} \odot \tilde\gamma \right)
$$

It sort of matches:

drdg <- t(scale(X %*% G1)) %*% (stub * t(matrix(gamma)) %x% matrix(rep(1, N)))

loop_drdg <- foreach(i = 1:4, .combine = c) %do% {
  t(scale(X %*% G1)[,i]) %*% (stub[,i, drop = F] * gamma[i])  
}

> drdg
           [,1]      [,2]       [,3]       [,4]
[1,]  0.8580574 -1.125017  -4.876398  0.4611406
[2,] -4.5463304  5.960787  25.837103 -2.4433071
[3,]  2.0706860 -2.714919 -11.767849  1.1128364
[4,] -8.5641868 11.228681  48.670853 -4.6025996
> loop_drdg
[1]   0.8580574   5.9607870 -11.7678486  -4.6025996

The diagonal on the first is the same as the vector on the second. But really since the derivative is with respect to a matrix — albeit one with a certain structure, the output should be a similar matrix with the same structure. Should I take the diagonal of the matrix approach and simply take it to be $\gamma$? I'm not sure.

It seems that I have answered my own question
but I am unsure whether I am correct. At this point I will accept an answer that rigorously proves (or disproves) what I've sort of hacked together.

while(not_answered){
  print("Bueller?")
  Sys.sleep(1)
}

Best Answer

Not a complete answer, but to demonstrate what I suggested in my comment if $$b(X)=(X−e_N\mu_X^T)ΓΣ_X^{-1/2}+e_N\beta^T$$ where $\Gamma=\mathop{\mathrm{diag}}(\gamma)$, $\Sigma_X^{-1/2}=\mathop{\mathrm{diag}}(\sigma_{X_1}^{-1},\sigma_{X_2}^{-1},\dots)$ and $e_N$ is a vector of ones, then by the chain rule $$\nabla_\beta R=[-2\hat{\epsilon}(\Gamma_2^T\otimes I)J_X(a)(I\otimes e_N)]^T$$ Noting that $-2\hat{\epsilon}(\Gamma_2^T\otimes I)=\mathop{\mathrm{vec}}(-2\hat{\epsilon}\Gamma_2^T)^T$ and $J_X(a)=\mathop{\mathrm{diag}}(\mathop{\mathrm{vec}}(a^\prime(b(X\Gamma_1))))$, we see that $$\nabla_\beta R=(I\otimes e_N^T)\mathop{\mathrm{vec}}(a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T)=e_N^T(a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T)$$ via the identity $\mathop{\mathrm{vec}}(AXB)=(B^T\otimes A)\mathop{\mathrm{vec}}(X)$. Similarly, $$\begin{align}\nabla_\gamma R&=[-2\hat{\epsilon}(\Gamma_2^T\otimes I)J_X(a)(\Sigma_{X\Gamma_1}^{-1/2}\otimes (X\Gamma_1-e_N\mu_{X\Gamma_1}^T))K]^T\\&=K^T\mathop{\mathrm{vec}}((X\Gamma_1-e_N\mu_{X\Gamma_1}^T)^TW\Sigma^{-1/2}_{X\Gamma_1})\\&=\mathop{\mathrm{diag}}((X\Gamma_1-e_N\mu_{X\Gamma_1}^T)^TW\Sigma^{-1/2}_{X\Gamma_1})\end{align}$$ where $W=a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T$ (the "stub") and $K$ is an $Np\times p$ binary matrix that selects the columns of the Kronecker product corresponding to the diagonal elements of a square matrix. This follows from the fact that $d\Gamma_{i\neq j}=0$. Unlike the first gradient, this expression is not equivalent to the expression you derived. Considering that $b$ is a linear function w.r.t $\gamma_i$, there should not be a factor of $\gamma_i$ in the gradient. I leave the gradient of $\Gamma_1$ to the OP, but I will say for derivation with fixed $w$ creates the "explosion" the writers of the article seek to avoid. In practice, you will also need to find the Jacobians of $\Sigma_X$ and $\mu_X$ w.r.t $X$ and use product rule.