Batch gradient descent – matrix operations

gradient descentmatricesneural networkspartial derivative

I was trying to implement simple FNN network using tutorial on this page.

So, given this super-simple network:

and if I use for training these 3 samples:

\begin{array} {|r|r|}
\hline
weight (feature 1 = x_1) &height (feature 2 = x_2) &gender (label=y_{true}) \\
\hline
-2 &-1 &1 \\
\hline
25 &6 &0 \\
\hline
17 &4 &0 \\
\hline
\end{array}

I wanted to define all of the matrix operations needed for forward and backward propagation.

Standard activation function that is used is sigmoid $S(x)=\frac{1}{1+e^{-x}}$ (both in hidden and in output layer), and I managed to find derivative for that one:

$$ S'(x)=\frac{(1+e^{-x})(1)'-(1)(1+e^{-x})'}{(1+e^{-x})^2}=-\frac{-e^{-x}}{(1+e^{-x})^2}=\frac{e^{-x}}{(1+e^{-x})^2}=\frac{e^{-x}}{1+e^{-x}}\frac{1}{1+e^{-x}}=
\left(\frac{1+e^{-x}-1}{1+e^{-x}}\right)\frac{1}{1+e^{-x}}=\left(\frac{1+e^{-x}}{1+e^{-x}}-\frac{1}{1+e^{-x}}\right)\frac{1}{1+e^{-x}}=\left(1-S(x)\right)S(x) $$

where

$$ (1)'=0 $$
$$ (1+e^{-x})'=(1)'+(e^{-x})'=0+(e^{-x})'=-e^{-x} $$
$$ (e^{-x})'=(\frac{1}{e^x})'=\frac{(e^x)(1)'-(1)(e^x)'}{(e^x)^2}=\frac{(e^x)(0)-(1)(e^x)}{e^{2x}}=\frac{-e^x}{e^xe^x}=\frac{-1}{e^x}=-e^{-x} $$

Cost function that is used is also standard mean squared error $MSE=\frac{1}{N}\sum_{i=1}^{N}(y_{true}-y_{pred})^2$ where in our case $N=3$, since this is how many training samples we have.

In order to calculate the MSE, we have $y_{true}$ which is already given in the training data as gender label. So we only need to calculate $y_{pred}$ for each of 3 input samples. $y_{pred}$ for first example is calculated as:

$$ y_{pred}=\frac{1}{1+e^{-(b3+\frac{1}{1+e^{-(b_1-2w_1-1w_2)}}w_5+\frac{1}{1+e^{-(b_2-2w_3-1w_4)}}w_6)}} $$

To calculate $y_{pred}$ for all samples at once we can use matrix operations (here we added in first column of training samples all ones to multiply with bias terms, so we have all included in one matrix – this is matrix $X$ below):

$$
h = X \times W_1 = \begin{bmatrix}
1 & -2 & -1\\
1 & 25 & 6\\
1 & 17 & 4
\end{bmatrix} \times
\begin{bmatrix}
b_1 & b_2\\
w_1 & w_3\\
w_2 & w_4
\end{bmatrix} =
\begin{bmatrix}
1b_1-2w_1-1w_2 & 1b_2-2w_3-1w_4\\
1b_1+25w_1+6w_2 & 1b_2+25w_3+6w_4\\
1b_1+17w_1+4w_2 & 1b_2+17w_3+4w_4
\end{bmatrix}
$$

After applying sigmoid activation function to matrix $h$ we get:

$$ S(h) = \begin{bmatrix}
S(h_{1,1}) & S(h_{1,2})\\
S(h_{2,1}) & S(h_{2,2})\\
S(h_{3,1}) & S(h_{3,2})
\end{bmatrix}
$$

In order to get matrix $o$ we again have to add 1 in the first column of $S(h)$ (for bias multiplication) – this is matrix $H$ below:

$$o= H \times W_2 =
\begin{bmatrix}
1 & S(h_{1,1}) & S(h_{1,2})\\
1 & S(h_{2,1}) & S(h_{2,2})\\
1 & S(h_{3,1}) & S(h_{3,2})
\end{bmatrix}
\times
\begin{bmatrix}
b_3\\
w_5\\
w_6
\end{bmatrix} =
\begin{bmatrix}
1b_3+S(h_{1,1})w_5+S(h_{1,2})w_6\\
1b_3+S(h_{2,1})w_5+S(h_{2,2})w_6\\
1b_3+S(h_{3,1})w_5+S(h_{3,2})w_6
\end{bmatrix}
$$

And finaly, after applying sigmoid activation function to our $o$ matrix, we get matrix of all $y_{pred}$ values:

$$y_{pred}=S(o)=
\begin{bmatrix}
S(o_1)\\
S(o_2)\\
S(o_3)
\end{bmatrix}=
\begin{bmatrix}
y_{pred_1}\\
y_{pred_2}\\
y_{pred_3}
\end{bmatrix}
$$

where $h$ is hidden layer, $o$ is output layer.

Now we can easily calculate MSE as

$$ MSE = \frac{1}{3} ((1-y_{pred_1})^2 + (0-y_{pred_2})^2 + (0-y_{pred_3})^2) $$

In order to do backpropagation, we need to adjust weights and biases (only $W_1$ presented here; here $A$ is a function that is AVERAGING all the derivatives across ALL of the N samples):

$$ W_{1_{adj}} = W_1-\alpha A\left(\frac{\partial MSE}{\partial (b_1,w_1,w_2,b_2,w_3,w_4)}\right) $$

Now I calculate partial MSE derivatives for $b_1,w_1,w_2,b_2,w_3,w_4$. Lets calculate partial MSE derivatives if we pretend to have only first training sample $(x_{1}=-2, x_{2}=-1)$ as input

$$
MSE_{x_f}=\frac{1}{1}\left((y_{true_{1}}-y_{pred_{1}})^2\right)'
$$

since

$$
y_{pred_{1}}=S(o_1)=\frac{1}{1+e^{-(b3+\frac{1}{1+e^{-(b_1+x_1w_1+x_2w_2)}}w_5+\frac{1}{1+e^{-(b_2+x_1w_3+x_2w_4)}}w_6)}}
$$

which can be written as

$$
S(o_1)=\frac{1}{1+e^{-(b3+S(h_1)w_5+S(h_2)w_6)}}
$$

where

$$
o_1=b3+S(h_1)w_5+S(h_2)w_6
$$

$$
S(h_1)=\frac{1}{1+e^{-(b_1+x_1w_1+x_2w_2)}}
$$

since

$$ h_1=b_1+x_1w_1+x_2w_2 $$

$$ S(h_2)=\frac{1}{1+e^{-(b_2+x_1w_3+x_2w_4)}} $$

since

$$ h_2=b_2+x_1w_3+x_2w_4 $$

MSE can also be represented as

$$
MSE_{x_f}=\left(y_{true_{1}}-S(o_1)\right)^2
$$

using chain rule we want to find:

$$
\frac{\partial MSE_{x_f}}{\partial b_1}, \frac{\partial MSE_{x_f}}{\partial w_1}, \frac{\partial MSE_{x_f}}{\partial w_2}, \frac{\partial MSE_{x_f}}{\partial b_2}, \frac{\partial MSE_{x_f}}{\partial w_3}, \frac{\partial MSE_{x_f}}{\partial w_4}, \frac{\partial MSE_{x_f}}{\partial b_3}, \frac{\partial MSE_{x_f}}{\partial w_5}, \frac{\partial MSE_{x_f}}{\partial w_6}
$$

$$
\frac{\partial MSE_{x_f}}{\partial b_1}=\frac{\partial MSE_{x_f}}{\partial S(o_1)}\frac{\partial S(o_1)}{\partial b_1}=\frac{\partial MSE_{x_f}}{\partial S(o_1)}\frac{\partial S(o_1)}{\partial S(h_1)}\frac{\partial S(h_1)}{\partial b_1} $$

where

$$
\frac{\partial MSE_{x_f}}{\partial S(o_1)}=\frac{\partial \left(y_{true_{1}}-S(o_1)\right)^2}{\partial S(o_1)}=2(y_{true_{1}}-S(o_1))\frac{\partial (y_{true_{1}}-S(o_1))}{\partial S(o_1)}=2(y_{true_{1}}-S(o_1))(0-1)=-2(y_{true_{1}}-S(o_1)) $$

$$
\frac{\partial S(o_1)}{\partial S(h_1)}=
\frac{\partial \frac{1}{1+e^{-(o_1)}}}{\partial o_1}\frac{\partial o_1}{\partial S(h_1)}
=(1-S(o_1))S(o_1)(\frac{\partial (b3+S(h_1)w_5+S(h_2)w_6)}{\partial S(h_1)})
=(1-S(o_1))S(o_1)w_5
$$

$$
\frac{\partial S(h_1)}{\partial b_1}=
\frac{\partial \frac{1}{1+e^{-(h_1)}}}{\partial h_1}\frac{\partial h_1}{\partial S(b_1)}
=(1-S(h_1))S(h_1)(\frac{\partial (b_1+x_1w_1+x_2w_2)}{\partial S(b_1)})
=(1-S(h_1))S(h_1)1
$$

and in the end we have

$$
\frac{\partial MSE_{x_f}}{\partial b_1}=
-2(y_{true_{1}}-S(o_1))(1-S(o_1))S(o_1)w_5(1-S(h_1))S(h_1)1
$$

similarly for others:

$$
\frac{\partial MSE_{x_f}}{\partial w_1}=
-2(y_{true_{1}}-S(o_1))(1-S(o_1))S(o_1)w_5(1-S(h_1))S(h_1)x_1
$$
$$
\frac{\partial MSE_{x_f}}{\partial w_2}=
-2(y_{true_{1}}-S(o_1))(1-S(o_1))S(o_1)w_5(1-S(h_1))S(h_1)x_2
$$
$$
\frac{\partial MSE_{x_f}}{\partial b_2}=
-2(y_{true_{1}}-S(o_1))(1-S(o_1))S(o_1)w_6(1-S(h_2))S(h_2)1
$$
$$
\frac{\partial MSE_{x_f}}{\partial w_3}=
-2(y_{true_{1}}-S(o_1))(1-S(o_1))S(o_1)w_6(1-S(h_2))S(h_2)x_1
$$
$$
\frac{\partial MSE_{x_f}}{\partial w_4}=
-2(y_{true_{1}}-S(o_1))(1-S(o_1))S(o_1)w_6(1-S(h_2))S(h_2)x_2
$$

these are the derivatives for hidden layer written in matrix form

$$ \begin{bmatrix}
\frac{\partial MSE_{x_f}}{\partial b_1} & \frac{\partial MSE_{x_f}}{\partial b_2}\\
\frac{\partial MSE_{x_f}}{\partial w_1} & \frac{\partial MSE_{x_f}}{\partial w_3}\\
\frac{\partial MSE_{x_f}}{\partial w_2} & \frac{\partial MSE_{x_f}}{\partial w_4}
\end{bmatrix} $$

but the matrix that I need to adjust the gradients in $W_1$ is (here $A$ is a function that is AVERAGING all of these derivatives across ALL of the N samples).

$$ \begin{bmatrix}
A(\frac{\partial MSE_{x_f}}{\partial b_1}) & A(\frac{\partial MSE_{x_f}}{\partial b_2})\\
A(\frac{\partial MSE_{x_f}}{\partial w_1}) & A(\frac{\partial MSE_{x_f}}{\partial w_3})\\
A(\frac{\partial MSE_{x_f}}{\partial w_2}) & A(\frac{\partial MSE_{x_f}}{\partial w_4})
\end{bmatrix} $$

Question is – how to get to this matrix, using only standard matrix operations (element-wise product, addition, dot multiplication), having all the data presented previously available?

I need this in the end to apply it in this equation:

$$ W_{1_{adj}} = \begin{bmatrix}
b_1 & b_2\\
w_1 & w_3\\
w_2 & w_4
\end{bmatrix}-\alpha \left(\begin{bmatrix}
A(\frac{\partial MSE_{x_f}}{\partial b_1}) & A(\frac{\partial MSE_{x_f}}{\partial b_2})\\
A(\frac{\partial MSE_{x_f}}{\partial w_1}) & A(\frac{\partial MSE_{x_f}}{\partial w_3})\\
A(\frac{\partial MSE_{x_f}}{\partial w_2}) & A(\frac{\partial MSE_{x_f}}{\partial w_4})
\end{bmatrix}\right) $$

Best Answer

OK, I managed to find it out, after making my python code work:

import numpy as np

#sigmoid function
def sigmoid(x):
  return 1/(1+np.exp(-x))

#derivative of sigmoid function
def sigmoid_deriv(x):
  return x * (1 - x)

#MSE function
def compute_cost(y,o):
  return np.average(np.power(y - o, 2))

#prepend ones for multiplication with biases b1 b2 b3
def prepend_ones(X):
  return np.hstack([np.ones((X.shape[0],1)), X])

#init all biases to 0 and all weights to 1
def initialize_model():
  W1 = np.array([
    [0,0], # b1 b2
    [1,1], # w1 w3
    [1,1]  # w2 w4
  ], dtype=np.float64).T

  W2 = np.array([
    [0], # b3
    [1], # w5
    [1]  # w6
  ], dtype=np.float64).T
  return W1.T,W2.T

def forward_prop(x, W1, W2):
  h = prepend_ones(x).dot(W1)
  h = sigmoid(h)
  o = prepend_ones(h).dot(W2)
  o = sigmoid(o)
  return o, h

def back_prop(x,y,o,h,W1,W2):
  E = y - o
  dZ = -2*E*sigmoid_deriv(o)
  Wz = W2[1:].T
  dH = dZ.dot(Wz) * sigmoid_deriv(h)
  grad_W1 = prepend_ones(x).T.dot(dH)
  grad_W2 = prepend_ones(h).T.dot(dZ)
  return grad_W1, grad_W2

def get_batch():
  data = np.array([
    [-2, -1],
    [25, 6],
    [17, 4],
    [-15, -6],
  ])
  labels = np.array([
    [1],
    [0],
    [0],
    [1],
  ])
  return data, labels

def gradient_descent(alpha=0.1):
  W1, W2 = initialize_model()
  for _ in range(0, 1000): #1000 epochs
      x,y = get_batch() #batch is always the same - all input samples
      o,h = forward_prop(x,W1,W2)
      cost = compute_cost(y,o)
      print(f"cost is {cost}")
      grad_W1, grad_W2 = back_prop(x,y,o,h,W1,W2)
      W1 = W1 - alpha*grad_W1
      W2 = W2 - alpha*grad_W2
  return W1, W2

W1,W2 = gradient_descent()

Seems that after all that I can use SUM(gradients) and not AVG(gradients) - it will converge in the end to the same thing? At least the result I got in the end is the same.

And this is the calc. I had to do to have all in place:

$$ \begin{bmatrix} \Sigma(\frac{\partial MSE_{x_f}}{\partial b_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial b_2})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_3})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_2}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_4}) \end{bmatrix} =X^T\times\left(\left(\left(-2 \begin{bmatrix} y_{true_{1}}-S(o_1)\\ y_{true_{2}}-S(o_2)\\ y_{true_{3}}-S(o_3)\\ \end{bmatrix} \begin{bmatrix} (1-S(o_1))S(o_1)\\ (1-S(o_2))S(o_2)\\ (1-S(o_3))S(o_3)\\ \end{bmatrix} \right)\times \begin{bmatrix} w_5 & w_6\\ \end{bmatrix} \right) \begin{bmatrix} (1-S(h_{11}))S(h_{11}) & (1-S(h_{12}))S(h_{12})\\ (1-S(h_{21}))S(h_{21}) & (1-S(h_{22}))S(h_{22})\\ (1-S(h_{31}))S(h_{31}) & (1-S(h_{32}))S(h_{32})\\ \end{bmatrix} \right) $$

The biggest problem was Wz = W2[1:].T where I had to remove $b_3$ element from matrix - that is to get $\begin{bmatrix} w_5 & w_6\\ \end{bmatrix}$ in order for all operations to complete. I guess if we have more hidden layers, we would need to remove bias term each time we go in back-propagation phase.

So still have to find out what is the difference between SUM and AVG of gradients - and is there some way what I do not have to remove bias terms from matrices when going in back-prop?

Best Answer

Related Solutions

[Math] Using Chain Rule in Matrix Differentiation

[Math] Numerator layout for derivatives and the chain rule

Related Question