A quaternion is just a point in $4$-space. You've got a function from, say, $[0, 2\pi]$ to $4$-space defined by
$$
\gamma: \theta \mapsto (\cos(\theta/2), 0, 0, \sin(\theta/2)).
$$
The derivative of that function is
$$
\gamma': \theta \mapsto \frac{1}{2}(-\sin(\theta/2), 0, 0, \cos(\theta/2)).
$$
Since $\gamma(\theta)$ is a unit quaternion for all $\theta$, its derivative should be a tangent vector to the unit sphere at $\gamma(\theta)$, i.e., the dot product (not quaternion product!) of $\gamma(\theta)$ and $\gamma'(\theta)$ should be zero...and in fact it is.
It's possible that what you want is the body-centered derivative, which is just
$\gamma(\theta)^{-1}\gamma'(\theta)$, where the "-1" denotes quaternion inverse, and the multiplication here is quaternion multiplication; that'll get you a unit quaternion in the tangent space to $(1,0,0,0)$, which consists of all pure-vector quaternions...but maybe that's not what you want. If it is, it happens to be easy to compute in this case.
I'm going to just write $\gamma$ for $\gamma(\theta)$ from now on, and "c" and "s" for the cosine and sine of $\theta/2$, OK?
Because $\gamma$ is a unit quaternion, its inverse is its conjugate, so
$$
\gamma^{-1} = \bar{\gamma} = (c, 0, 0, -s).
$$
That makes the body-centered derivative be
$$
\gamma^{-1} \gamma' = (c, 0, 0, -s) \star (-s, 0, 0, c),
$$
where I've used $\star$ to denote quaternion multiplication. The result is
\begin{align}
\gamma^{-1} \gamma' &= (c, 0, 0, -s) \star \frac{1}{2}(-s, 0, 0, c)\\
&= \frac{1}{2}(-sc + sc; c(0,0,c) -s (0,0,-s) - (0,0,c) \times (0,0,-s)) \\
&= \frac{1}{2}(0; (0,0,c^2 + s^2) \\
&= \frac{1}{2}(0, 0,0,1).
\end{align}
In other words, your body-centered derivative is, at all times, $\frac{1}{2} {\mathbf k}$.
OK, I managed to find it out, after making my python code work:
import numpy as np
#sigmoid function
def sigmoid(x):
return 1/(1+np.exp(-x))
#derivative of sigmoid function
def sigmoid_deriv(x):
return x * (1 - x)
#MSE function
def compute_cost(y,o):
return np.average(np.power(y - o, 2))
#prepend ones for multiplication with biases b1 b2 b3
def prepend_ones(X):
return np.hstack([np.ones((X.shape[0],1)), X])
#init all biases to 0 and all weights to 1
def initialize_model():
W1 = np.array([
[0,0], # b1 b2
[1,1], # w1 w3
[1,1] # w2 w4
], dtype=np.float64).T
W2 = np.array([
[0], # b3
[1], # w5
[1] # w6
], dtype=np.float64).T
return W1.T,W2.T
def forward_prop(x, W1, W2):
h = prepend_ones(x).dot(W1)
h = sigmoid(h)
o = prepend_ones(h).dot(W2)
o = sigmoid(o)
return o, h
def back_prop(x,y,o,h,W1,W2):
E = y - o
dZ = -2*E*sigmoid_deriv(o)
Wz = W2[1:].T
dH = dZ.dot(Wz) * sigmoid_deriv(h)
grad_W1 = prepend_ones(x).T.dot(dH)
grad_W2 = prepend_ones(h).T.dot(dZ)
return grad_W1, grad_W2
def get_batch():
data = np.array([
[-2, -1],
[25, 6],
[17, 4],
[-15, -6],
])
labels = np.array([
[1],
[0],
[0],
[1],
])
return data, labels
def gradient_descent(alpha=0.1):
W1, W2 = initialize_model()
for _ in range(0, 1000): #1000 epochs
x,y = get_batch() #batch is always the same - all input samples
o,h = forward_prop(x,W1,W2)
cost = compute_cost(y,o)
print(f"cost is {cost}")
grad_W1, grad_W2 = back_prop(x,y,o,h,W1,W2)
W1 = W1 - alpha*grad_W1
W2 = W2 - alpha*grad_W2
return W1, W2
W1,W2 = gradient_descent()
Seems that after all that I can use SUM(gradients) and not AVG(gradients) - it will converge in the end to the same thing? At least the result I got in the end is the same.
And this is the calc. I had to do to have all in place:
$$ \begin{bmatrix}
\Sigma(\frac{\partial MSE_{x_f}}{\partial b_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial b_2})\\
\Sigma(\frac{\partial MSE_{x_f}}{\partial w_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_3})\\
\Sigma(\frac{\partial MSE_{x_f}}{\partial w_2}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_4})
\end{bmatrix}
=X^T\times\left(\left(\left(-2
\begin{bmatrix}
y_{true_{1}}-S(o_1)\\
y_{true_{2}}-S(o_2)\\
y_{true_{3}}-S(o_3)\\
\end{bmatrix}
\begin{bmatrix}
(1-S(o_1))S(o_1)\\
(1-S(o_2))S(o_2)\\
(1-S(o_3))S(o_3)\\
\end{bmatrix}
\right)\times
\begin{bmatrix}
w_5 & w_6\\
\end{bmatrix}
\right)
\begin{bmatrix}
(1-S(h_{11}))S(h_{11}) & (1-S(h_{12}))S(h_{12})\\
(1-S(h_{21}))S(h_{21}) & (1-S(h_{22}))S(h_{22})\\
(1-S(h_{31}))S(h_{31}) & (1-S(h_{32}))S(h_{32})\\
\end{bmatrix}
\right)
$$
The biggest problem was Wz = W2[1:].T
where I had to remove $b_3$ element from matrix - that is to get $\begin{bmatrix}
w_5 & w_6\\
\end{bmatrix}$ in order for all operations to complete. I guess if we have more hidden layers, we would need to remove bias term each time we go in back-propagation phase.
So still have to find out what is the difference between SUM and AVG of gradients - and is there some way what I do not have to remove bias terms from matrices when going in back-prop?
Best Answer
If there are actually $m$ input variables, you write sum in the Equation $87$ in the ebook in the notation $$ \sum_{i=1}^m w_i^2, $$ and it can be viewed as a function of the $m$ variables $w_1, \ldots, w_m.$ The "derivative" in the ebook is a partial derivative, which deals with how the function value would change if you could slightly increase or decrease just one of the $m$ input variables while leaving all the others unchanged. The notation $\frac{\partial}{\partial w}$ in the ebook means the same thing as you would recognize in the $\frac{\partial}{\partial w_i},$ that is, it is a partial derivative with respect to one variable, but the ebook has chosen to let the letter $w$ by itself represent one of the $m$ variables rather than use a subscript.
The partial derivative of the sum of two functions is the sum of the partial derivatives, just like you are used to in the case of single-variable functions, but only when both partial derivatives are with respect to the same variable. The partial derivatives of different variables do not add up in the manner you imagine; and in any case, the ebook definitely means to take the partial derivative of one variable over the entire sum.
When we write $$ \frac{\partial}{\partial w_j} w_i^2, $$ the result is zero unless $i = j,$ because in a partial derivative $\frac{\partial}{\partial w_j}$ over the variables $w_1, \ldots, w_m,$ all the variables except $w_j$ act like constants. On the other hand, $$ \frac{\partial}{\partial w_j} w_j^2 = 2w_j, $$ because that describes how the function $w_j^2$ changes as we vary $w_j.$
To spell it out in gory detail, what you actually have is \begin{align} \frac{\partial}{\partial w_j} \frac{\lambda}{2n} \sum_{i=1}^m w_i^2 &= \frac{\lambda}{2n} \frac{\partial}{\partial w_j}\left( w_1^2 + \cdots + w_{j-1}^2 + w_j^2 + w_{j-1}^2 + \cdots + w_m^2 \right) \\ & = \frac{\lambda}{2n} \left(\frac{\partial}{\partial w_j}w_1^2 + \cdots + \frac{\partial}{\partial w_j}w_{j-1}^2 + \frac{\partial}{\partial w_j}w_j^2 + \frac{\partial}{\partial w_j}w_{j+1}^2 + \cdots + \frac{\partial}{\partial w_j}w_m^2 \right) \\ & = \frac{\lambda}{2n} \left(0 + \cdots + 0 + 2w_j + 0 + \cdots + 0\right) \\ & = \frac{\lambda}{n} w_j. \end{align}