[Math] Numerator layout for derivatives and the chain rule

matricesmatrix-calculus

We have three matrices $\mathbf{W_2}$, $\mathbf{W_1}$ and $\mathbf{h}$ (technically a column vector):

$$
\mathbf{W_1} =
\begin{bmatrix}
a & b \\
c & d \\
\end{bmatrix}
\;\;\;\;\;\;\;\;\;
\mathbf{W_2} =
\begin{bmatrix}
e & f \\
\end{bmatrix}
\;\;\;\;\;\;\;\;\;
\mathbf{h} =
\begin{bmatrix}
h_1 \\
h_2 \\
\end{bmatrix}
$$

And a scalar $y$, where:

$$
y = \mathbf{W_2} \mathbf{W_1} \mathbf{h}
$$

I'd like to compute the derivative of $y$ with respect to $\mathbf{W_1}$, assuming numerator layout.

Using the chain rule:

$$
y = \mathbf{W_2} \mathbf{u}
\;\;\;\;\;\;\;\;\;
\mathbf{u} = \mathbf{W_1} \mathbf{h} \\
$$

$$
\begin{align}
\frac{\partial y}{\partial \mathbf{W_1}} &=
\frac{\partial y}{\partial \mathbf{u}} \frac{\partial \mathbf{u}}{\partial \mathbf{W_1}} \\
&= \mathbf{W_2} \frac{\partial \mathbf{u}}{\partial \mathbf{W_1}} \\
&= \mathbf{W_2} \mathbf{h}^{\top} \\
\end{align}
$$

All well and good. Except – this isn't a $2×2$ matrix!! In fact, the dimensions don't match up for matrix multiplication, so something must be incorrect.

If we take the Wikipedia definition of the derivative of a scalar by a matrix, using numerator layout, we know that actually:

$$
\frac{\partial y}{\partial \mathbf{W_1}} =
\begin{bmatrix}
\frac{\partial y}{\partial a} & \frac{\partial y}{\partial c} \\
\frac{\partial y}{\partial b} & \frac{\partial y}{\partial d} \\
\end{bmatrix}
$$

Each element is just a scalar derivative, which we can calculate without any matric calculus. If we do that by hand and then factorise, we end up with:

$$
\frac{\partial y}{\partial \mathbf{W_1}} = \mathbf{h} \mathbf{W_2}
$$

Clearly, $\mathbf{h} \mathbf{W_2} \neq \mathbf{W_2} \mathbf{h}^\top $.

Can anybody suggest where I went wrong?

Best Answer

$\def\p#1#2{\frac{\partial #1}{\partial #2}}$ Define the trace/Frobenius product as $$A:B \;=\; {\rm Tr}(A^TB) \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij}$$ Using this product eliminates a whole category of transposition errors, which arise in other approaches.

The properties of the trace allow such products to be arranges in many equivalent ways, e.g. $$\eqalign{ A:C &= C:A &= C^T:A^T \\ AB:C &= A:CB^T &= B:A^TC \\ }$$ Note that the matrices on the LHS and RHS of the colon must have the same dimensions.
The Frobenius product is similar to the Hadamard product in this respect.

Let's define some variables without the distracting subscripts $$W = W_1, \qquad w=W_2^T$$ Write the scalar function in terms of these new variables. Then calculate its differential and gradient. $$\eqalign{ y &= w:Wh \\&= wh^T:W \\ dy &= wh^T:dW \\ \p{y}{W} &= wh^T \\ }$$ The dimensions of this result equal the dimensions of the $W$ matrix, expressed as the outer product of two column vectors.

Writing this in terms of the original variables $$\eqalign{ \p{y}{W_1} &= W_2^Th^T \\ }$$

Related Solutions

[Math] Using Chain Rule in Matrix Differentiation

Rather than the chain rule, let's tackle the problem using differentials.

Let's use the convention that an upppercase letter is a matrix, lowercase is a column vector, and a greek letter is a scalar. Now let's define some variables and their differentials $$\eqalign{ z &= W_1x+b_1 &\implies dz=dW_1\,x \cr h &= {\rm relu}(z) &\implies dh={\rm step}(z)\odot dz = s\odot dz \cr }$$ where ${\rm step}(z)$ is the Heaviside step function. Both the relu and step functions are applied elementwise to their arguments.

Now take the differential and gradient of the function. $$\eqalign{ \phi &= w_2^Th + \beta_2 \cr &= w_2:h + \beta_2 \cr \cr d\phi &= w_2:dh \cr &= w_2:(s\odot dz) \cr &= (s\odot w_2):dz \cr &= (s\odot w_2):dW_1\,x \cr &= (s\odot w_2)x^T:dW_1 \cr \cr \frac{\partial\phi}{\partial W_1} &= (w_2\odot s)x^T \cr &= \Big(w_2\odot{\rm step}(W_1x+b_1)\Big)x^T \cr\cr }$$ In the above, I used the notations $$\eqalign{ &A:B = {\rm tr}(A^TB) \cr &A\odot B \cr }$$ for the trace/Frobenius and elementwise/Hadamard products, respectively.

Batch gradient descent – matrix operations

OK, I managed to find it out, after making my python code work:

import numpy as np

#sigmoid function
def sigmoid(x):
  return 1/(1+np.exp(-x))

#derivative of sigmoid function
def sigmoid_deriv(x):
  return x * (1 - x)

#MSE function
def compute_cost(y,o):
  return np.average(np.power(y - o, 2))

#prepend ones for multiplication with biases b1 b2 b3
def prepend_ones(X):
  return np.hstack([np.ones((X.shape[0],1)), X])

#init all biases to 0 and all weights to 1
def initialize_model():
  W1 = np.array([
    [0,0], # b1 b2
    [1,1], # w1 w3
    [1,1]  # w2 w4
  ], dtype=np.float64).T

  W2 = np.array([
    [0], # b3
    [1], # w5
    [1]  # w6
  ], dtype=np.float64).T
  return W1.T,W2.T

def forward_prop(x, W1, W2):
  h = prepend_ones(x).dot(W1)
  h = sigmoid(h)
  o = prepend_ones(h).dot(W2)
  o = sigmoid(o)
  return o, h

def back_prop(x,y,o,h,W1,W2):
  E = y - o
  dZ = -2*E*sigmoid_deriv(o)
  Wz = W2[1:].T
  dH = dZ.dot(Wz) * sigmoid_deriv(h)
  grad_W1 = prepend_ones(x).T.dot(dH)
  grad_W2 = prepend_ones(h).T.dot(dZ)
  return grad_W1, grad_W2

def get_batch():
  data = np.array([
    [-2, -1],
    [25, 6],
    [17, 4],
    [-15, -6],
  ])
  labels = np.array([
    [1],
    [0],
    [0],
    [1],
  ])
  return data, labels

def gradient_descent(alpha=0.1):
  W1, W2 = initialize_model()
  for _ in range(0, 1000): #1000 epochs
      x,y = get_batch() #batch is always the same - all input samples
      o,h = forward_prop(x,W1,W2)
      cost = compute_cost(y,o)
      print(f"cost is {cost}")
      grad_W1, grad_W2 = back_prop(x,y,o,h,W1,W2)
      W1 = W1 - alpha*grad_W1
      W2 = W2 - alpha*grad_W2
  return W1, W2

W1,W2 = gradient_descent()

Seems that after all that I can use SUM(gradients) and not AVG(gradients) - it will converge in the end to the same thing? At least the result I got in the end is the same.

And this is the calc. I had to do to have all in place:

$$ \begin{bmatrix} \Sigma(\frac{\partial MSE_{x_f}}{\partial b_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial b_2})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_3})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_2}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_4}) \end{bmatrix} =X^T\times\left(\left(\left(-2 \begin{bmatrix} y_{true_{1}}-S(o_1)\\ y_{true_{2}}-S(o_2)\\ y_{true_{3}}-S(o_3)\\ \end{bmatrix} \begin{bmatrix} (1-S(o_1))S(o_1)\\ (1-S(o_2))S(o_2)\\ (1-S(o_3))S(o_3)\\ \end{bmatrix} \right)\times \begin{bmatrix} w_5 & w_6\\ \end{bmatrix} \right) \begin{bmatrix} (1-S(h_{11}))S(h_{11}) & (1-S(h_{12}))S(h_{12})\\ (1-S(h_{21}))S(h_{21}) & (1-S(h_{22}))S(h_{22})\\ (1-S(h_{31}))S(h_{31}) & (1-S(h_{32}))S(h_{32})\\ \end{bmatrix} \right) $$

The biggest problem was Wz = W2[1:].T where I had to remove $b_3$ element from matrix - that is to get $\begin{bmatrix} w_5 & w_6\\ \end{bmatrix}$ in order for all operations to complete. I guess if we have more hidden layers, we would need to remove bias term each time we go in back-propagation phase.

So still have to find out what is the difference between SUM and AVG of gradients - and is there some way what I do not have to remove bias terms from matrices when going in back-prop?

Best Answer

Related Solutions

[Math] Using Chain Rule in Matrix Differentiation

Batch gradient descent – matrix operations

Related Question