Quickly compute matrix derivatives

derivativeslinear algebramachine learningmatricespartial derivative

I have studied the mathematics behind autoencoders. In a proof, a minimization problem is rewritten several times by taking the derivative regarding matrices/ vectors.

Notation: $W_1$, $W_2$ are matrices. $b_1$, $b_2$ and $x$ are vectors.

The first problem in example is :

$$\min_{W_1, b_1, W_2, b_2} || x – (W_2(W_1x+b_1)+b_2)||^2$$

Then, it is stated that we take the partial derivatives with respect to $b_1, b_2$ and set them to $0$. This yields:
$$\min_{W_1, W_2} || x – W_2W_1x||^2$$

My question is, how can I as quick as possible compute those derivatives? My first approach was to multiply $|| x – W_2(W_1x+b_1)+b_2||^2$ out, but then I got an endless long term where I messed up the derivative.
Taking the derivative by the idea "inner derivative times outer derivative" was also not the best idea…

Has someone maybe a tip how I could proceed in such cases? Thanks a million in advance! 🙂

Best Answer

It's pretty quick to use the chain rule here. Note that $$ \frac{\partial }{\partial x} \|x\|^2 = 2 x^T. $$ With that, $$ \frac{\partial }{\partial b_1}\|x - (W_2(W_1x+b_1)+b_2)\|^2 \\= [x - (W_2(W_1x+b_1)+b_2]^T \frac{\partial }{\partial b_1} [x - (W_2(W_1x+b_1)+b_2)] \\ = 2[x - (W_2(W_1x+b_1)+b_2]^T (-W_2). $$

Related Solutions

[Math] Using Chain Rule in Matrix Differentiation

Rather than the chain rule, let's tackle the problem using differentials.

Let's use the convention that an upppercase letter is a matrix, lowercase is a column vector, and a greek letter is a scalar. Now let's define some variables and their differentials $$\eqalign{ z &= W_1x+b_1 &\implies dz=dW_1\,x \cr h &= {\rm relu}(z) &\implies dh={\rm step}(z)\odot dz = s\odot dz \cr }$$ where ${\rm step}(z)$ is the Heaviside step function. Both the relu and step functions are applied elementwise to their arguments.

Now take the differential and gradient of the function. $$\eqalign{ \phi &= w_2^Th + \beta_2 \cr &= w_2:h + \beta_2 \cr \cr d\phi &= w_2:dh \cr &= w_2:(s\odot dz) \cr &= (s\odot w_2):dz \cr &= (s\odot w_2):dW_1\,x \cr &= (s\odot w_2)x^T:dW_1 \cr \cr \frac{\partial\phi}{\partial W_1} &= (w_2\odot s)x^T \cr &= \Big(w_2\odot{\rm step}(W_1x+b_1)\Big)x^T \cr\cr }$$ In the above, I used the notations $$\eqalign{ &A:B = {\rm tr}(A^TB) \cr &A\odot B \cr }$$ for the trace/Frobenius and elementwise/Hadamard products, respectively.

Batch gradient descent – matrix operations

OK, I managed to find it out, after making my python code work:

import numpy as np

#sigmoid function
def sigmoid(x):
  return 1/(1+np.exp(-x))

#derivative of sigmoid function
def sigmoid_deriv(x):
  return x * (1 - x)

#MSE function
def compute_cost(y,o):
  return np.average(np.power(y - o, 2))

#prepend ones for multiplication with biases b1 b2 b3
def prepend_ones(X):
  return np.hstack([np.ones((X.shape[0],1)), X])

#init all biases to 0 and all weights to 1
def initialize_model():
  W1 = np.array([
    [0,0], # b1 b2
    [1,1], # w1 w3
    [1,1]  # w2 w4
  ], dtype=np.float64).T

  W2 = np.array([
    [0], # b3
    [1], # w5
    [1]  # w6
  ], dtype=np.float64).T
  return W1.T,W2.T

def forward_prop(x, W1, W2):
  h = prepend_ones(x).dot(W1)
  h = sigmoid(h)
  o = prepend_ones(h).dot(W2)
  o = sigmoid(o)
  return o, h

def back_prop(x,y,o,h,W1,W2):
  E = y - o
  dZ = -2*E*sigmoid_deriv(o)
  Wz = W2[1:].T
  dH = dZ.dot(Wz) * sigmoid_deriv(h)
  grad_W1 = prepend_ones(x).T.dot(dH)
  grad_W2 = prepend_ones(h).T.dot(dZ)
  return grad_W1, grad_W2

def get_batch():
  data = np.array([
    [-2, -1],
    [25, 6],
    [17, 4],
    [-15, -6],
  ])
  labels = np.array([
    [1],
    [0],
    [0],
    [1],
  ])
  return data, labels

def gradient_descent(alpha=0.1):
  W1, W2 = initialize_model()
  for _ in range(0, 1000): #1000 epochs
      x,y = get_batch() #batch is always the same - all input samples
      o,h = forward_prop(x,W1,W2)
      cost = compute_cost(y,o)
      print(f"cost is {cost}")
      grad_W1, grad_W2 = back_prop(x,y,o,h,W1,W2)
      W1 = W1 - alpha*grad_W1
      W2 = W2 - alpha*grad_W2
  return W1, W2

W1,W2 = gradient_descent()

Seems that after all that I can use SUM(gradients) and not AVG(gradients) - it will converge in the end to the same thing? At least the result I got in the end is the same.

And this is the calc. I had to do to have all in place:

$$ \begin{bmatrix} \Sigma(\frac{\partial MSE_{x_f}}{\partial b_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial b_2})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_3})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_2}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_4}) \end{bmatrix} =X^T\times\left(\left(\left(-2 \begin{bmatrix} y_{true_{1}}-S(o_1)\\ y_{true_{2}}-S(o_2)\\ y_{true_{3}}-S(o_3)\\ \end{bmatrix} \begin{bmatrix} (1-S(o_1))S(o_1)\\ (1-S(o_2))S(o_2)\\ (1-S(o_3))S(o_3)\\ \end{bmatrix} \right)\times \begin{bmatrix} w_5 & w_6\\ \end{bmatrix} \right) \begin{bmatrix} (1-S(h_{11}))S(h_{11}) & (1-S(h_{12}))S(h_{12})\\ (1-S(h_{21}))S(h_{21}) & (1-S(h_{22}))S(h_{22})\\ (1-S(h_{31}))S(h_{31}) & (1-S(h_{32}))S(h_{32})\\ \end{bmatrix} \right) $$

The biggest problem was Wz = W2[1:].T where I had to remove $b_3$ element from matrix - that is to get $\begin{bmatrix} w_5 & w_6\\ \end{bmatrix}$ in order for all operations to complete. I guess if we have more hidden layers, we would need to remove bias term each time we go in back-propagation phase.

So still have to find out what is the difference between SUM and AVG of gradients - and is there some way what I do not have to remove bias terms from matrices when going in back-prop?

Best Answer

Related Solutions

[Math] Using Chain Rule in Matrix Differentiation

Batch gradient descent – matrix operations

Related Question