Rather than the chain rule, let's tackle the problem using differentials.
Let's use the convention that an upppercase letter is a matrix, lowercase is a column vector, and a greek letter is a scalar. Now let's define some variables and their differentials
$$\eqalign{
z &= W_1x+b_1 &\implies dz=dW_1\,x \cr
h &= {\rm relu}(z) &\implies dh={\rm step}(z)\odot dz = s\odot dz \cr
}$$ where ${\rm step}(z)$ is the Heaviside step function. Both the relu and step functions are applied elementwise to their arguments.
Now take the differential and gradient of the function.
$$\eqalign{
\phi &= w_2^Th + \beta_2 \cr
&= w_2:h + \beta_2 \cr
\cr
d\phi
&= w_2:dh \cr
&= w_2:(s\odot dz) \cr
&= (s\odot w_2):dz \cr
&= (s\odot w_2):dW_1\,x \cr
&= (s\odot w_2)x^T:dW_1 \cr
\cr
\frac{\partial\phi}{\partial W_1}
&= (w_2\odot s)x^T \cr
&= \Big(w_2\odot{\rm step}(W_1x+b_1)\Big)x^T \cr\cr
}$$
In the above, I used the notations
$$\eqalign{
&A:B = {\rm tr}(A^TB) \cr
&A\odot B \cr
}$$
for the trace/Frobenius and elementwise/Hadamard products, respectively.
OK, I managed to find it out, after making my python code work:
import numpy as np
#sigmoid function
def sigmoid(x):
return 1/(1+np.exp(-x))
#derivative of sigmoid function
def sigmoid_deriv(x):
return x * (1 - x)
#MSE function
def compute_cost(y,o):
return np.average(np.power(y - o, 2))
#prepend ones for multiplication with biases b1 b2 b3
def prepend_ones(X):
return np.hstack([np.ones((X.shape[0],1)), X])
#init all biases to 0 and all weights to 1
def initialize_model():
W1 = np.array([
[0,0], # b1 b2
[1,1], # w1 w3
[1,1] # w2 w4
], dtype=np.float64).T
W2 = np.array([
[0], # b3
[1], # w5
[1] # w6
], dtype=np.float64).T
return W1.T,W2.T
def forward_prop(x, W1, W2):
h = prepend_ones(x).dot(W1)
h = sigmoid(h)
o = prepend_ones(h).dot(W2)
o = sigmoid(o)
return o, h
def back_prop(x,y,o,h,W1,W2):
E = y - o
dZ = -2*E*sigmoid_deriv(o)
Wz = W2[1:].T
dH = dZ.dot(Wz) * sigmoid_deriv(h)
grad_W1 = prepend_ones(x).T.dot(dH)
grad_W2 = prepend_ones(h).T.dot(dZ)
return grad_W1, grad_W2
def get_batch():
data = np.array([
[-2, -1],
[25, 6],
[17, 4],
[-15, -6],
])
labels = np.array([
[1],
[0],
[0],
[1],
])
return data, labels
def gradient_descent(alpha=0.1):
W1, W2 = initialize_model()
for _ in range(0, 1000): #1000 epochs
x,y = get_batch() #batch is always the same - all input samples
o,h = forward_prop(x,W1,W2)
cost = compute_cost(y,o)
print(f"cost is {cost}")
grad_W1, grad_W2 = back_prop(x,y,o,h,W1,W2)
W1 = W1 - alpha*grad_W1
W2 = W2 - alpha*grad_W2
return W1, W2
W1,W2 = gradient_descent()
Seems that after all that I can use SUM(gradients) and not AVG(gradients) - it will converge in the end to the same thing? At least the result I got in the end is the same.
And this is the calc. I had to do to have all in place:
$$ \begin{bmatrix}
\Sigma(\frac{\partial MSE_{x_f}}{\partial b_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial b_2})\\
\Sigma(\frac{\partial MSE_{x_f}}{\partial w_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_3})\\
\Sigma(\frac{\partial MSE_{x_f}}{\partial w_2}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_4})
\end{bmatrix}
=X^T\times\left(\left(\left(-2
\begin{bmatrix}
y_{true_{1}}-S(o_1)\\
y_{true_{2}}-S(o_2)\\
y_{true_{3}}-S(o_3)\\
\end{bmatrix}
\begin{bmatrix}
(1-S(o_1))S(o_1)\\
(1-S(o_2))S(o_2)\\
(1-S(o_3))S(o_3)\\
\end{bmatrix}
\right)\times
\begin{bmatrix}
w_5 & w_6\\
\end{bmatrix}
\right)
\begin{bmatrix}
(1-S(h_{11}))S(h_{11}) & (1-S(h_{12}))S(h_{12})\\
(1-S(h_{21}))S(h_{21}) & (1-S(h_{22}))S(h_{22})\\
(1-S(h_{31}))S(h_{31}) & (1-S(h_{32}))S(h_{32})\\
\end{bmatrix}
\right)
$$
The biggest problem was Wz = W2[1:].T
where I had to remove $b_3$ element from matrix - that is to get $\begin{bmatrix}
w_5 & w_6\\
\end{bmatrix}$ in order for all operations to complete. I guess if we have more hidden layers, we would need to remove bias term each time we go in back-propagation phase.
So still have to find out what is the difference between SUM and AVG of gradients - and is there some way what I do not have to remove bias terms from matrices when going in back-prop?
Best Answer
It's pretty quick to use the chain rule here. Note that $$ \frac{\partial }{\partial x} \|x\|^2 = 2 x^T. $$ With that, $$ \frac{\partial }{\partial b_1}\|x - (W_2(W_1x+b_1)+b_2)\|^2 \\= [x - (W_2(W_1x+b_1)+b_2]^T \frac{\partial }{\partial b_1} [x - (W_2(W_1x+b_1)+b_2)] \\ = 2[x - (W_2(W_1x+b_1)+b_2]^T (-W_2). $$