[Math] Why does the summation dissapear when taking the derivative of the sum of squares

derivativespartial derivativepartial differential equations

Why is it that the derivative of the sum of squares of a vector, w:
\begin{eqnarray} \frac{\lambda}{2n}
\sum_w w^2,
\end{eqnarray}

turns out to be

\begin{eqnarray}
\frac{\lambda}{n} w
\end{eqnarray}

and not

\begin{eqnarray} \frac{\lambda}{n}
\sum_w w \;?
\end{eqnarray}

Basically as I see it, we've got

\begin{eqnarray}
w = [w_1, w_2, w_3 …]
\end{eqnarray}

\begin{eqnarray}
\frac{d}{dw} \frac{\lambda}{2n} \sum_w w^2 = \frac{\lambda}{2n}(\frac{\partial}{\partial w_1} \sum_w w^2 + \frac{\partial}{\partial w_2} \sum_w w^2 + \frac{\partial}{\partial w_3} \sum_w w^2 …)
\end{eqnarray}

\begin{eqnarray}
= \frac{\lambda}{n} (w_1 + w_2 + w_3 …)
\end{eqnarray}

\begin{eqnarray}
= \frac{\lambda}{n} \sum_w w
\end{eqnarray}

I'm following this ebook here (equations 87/88, which are basically the same as what I've written above). The main thing I don't understand is why we can eliminate the summation. Any math books or writeups on the subject would also be helpful.

Best Answer

If there are actually $m$ input variables, you write sum in the Equation $87$ in the ebook in the notation $$ \sum_{i=1}^m w_i^2, $$ and it can be viewed as a function of the $m$ variables $w_1, \ldots, w_m.$ The "derivative" in the ebook is a partial derivative, which deals with how the function value would change if you could slightly increase or decrease just one of the $m$ input variables while leaving all the others unchanged. The notation $\frac{\partial}{\partial w}$ in the ebook means the same thing as you would recognize in the $\frac{\partial}{\partial w_i},$ that is, it is a partial derivative with respect to one variable, but the ebook has chosen to let the letter $w$ by itself represent one of the $m$ variables rather than use a subscript.

The partial derivative of the sum of two functions is the sum of the partial derivatives, just like you are used to in the case of single-variable functions, but only when both partial derivatives are with respect to the same variable. The partial derivatives of different variables do not add up in the manner you imagine; and in any case, the ebook definitely means to take the partial derivative of one variable over the entire sum.

When we write $$ \frac{\partial}{\partial w_j} w_i^2, $$ the result is zero unless $i = j,$ because in a partial derivative $\frac{\partial}{\partial w_j}$ over the variables $w_1, \ldots, w_m,$ all the variables except $w_j$ act like constants. On the other hand, $$ \frac{\partial}{\partial w_j} w_j^2 = 2w_j, $$ because that describes how the function $w_j^2$ changes as we vary $w_j.$

To spell it out in gory detail, what you actually have is \begin{align} \frac{\partial}{\partial w_j} \frac{\lambda}{2n} \sum_{i=1}^m w_i^2 &= \frac{\lambda}{2n} \frac{\partial}{\partial w_j}\left( w_1^2 + \cdots + w_{j-1}^2 + w_j^2 + w_{j-1}^2 + \cdots + w_m^2 \right) \\ & = \frac{\lambda}{2n} \left(\frac{\partial}{\partial w_j}w_1^2 + \cdots + \frac{\partial}{\partial w_j}w_{j-1}^2 + \frac{\partial}{\partial w_j}w_j^2 + \frac{\partial}{\partial w_j}w_{j+1}^2 + \cdots + \frac{\partial}{\partial w_j}w_m^2 \right) \\ & = \frac{\lambda}{2n} \left(0 + \cdots + 0 + 2w_j + 0 + \cdots + 0\right) \\ & = \frac{\lambda}{n} w_j. \end{align}

Related Solutions

[Math] Quaternion derivative w.r.t. its angle

A quaternion is just a point in $4$-space. You've got a function from, say, $[0, 2\pi]$ to $4$-space defined by

$$ \gamma: \theta \mapsto (\cos(\theta/2), 0, 0, \sin(\theta/2)). $$

The derivative of that function is $$ \gamma': \theta \mapsto \frac{1}{2}(-\sin(\theta/2), 0, 0, \cos(\theta/2)). $$

Since $\gamma(\theta)$ is a unit quaternion for all $\theta$, its derivative should be a tangent vector to the unit sphere at $\gamma(\theta)$, i.e., the dot product (not quaternion product!) of $\gamma(\theta)$ and $\gamma'(\theta)$ should be zero...and in fact it is.

It's possible that what you want is the body-centered derivative, which is just $\gamma(\theta)^{-1}\gamma'(\theta)$, where the "-1" denotes quaternion inverse, and the multiplication here is quaternion multiplication; that'll get you a unit quaternion in the tangent space to $(1,0,0,0)$, which consists of all pure-vector quaternions...but maybe that's not what you want. If it is, it happens to be easy to compute in this case.

I'm going to just write $\gamma$ for $\gamma(\theta)$ from now on, and "c" and "s" for the cosine and sine of $\theta/2$, OK?

Because $\gamma$ is a unit quaternion, its inverse is its conjugate, so $$ \gamma^{-1} = \bar{\gamma} = (c, 0, 0, -s). $$ That makes the body-centered derivative be $$ \gamma^{-1} \gamma' = (c, 0, 0, -s) \star (-s, 0, 0, c), $$ where I've used $\star$ to denote quaternion multiplication. The result is \begin{align} \gamma^{-1} \gamma' &= (c, 0, 0, -s) \star \frac{1}{2}(-s, 0, 0, c)\\ &= \frac{1}{2}(-sc + sc; c(0,0,c) -s (0,0,-s) - (0,0,c) \times (0,0,-s)) \\ &= \frac{1}{2}(0; (0,0,c^2 + s^2) \\ &= \frac{1}{2}(0, 0,0,1). \end{align} In other words, your body-centered derivative is, at all times, $\frac{1}{2} {\mathbf k}$.

Batch gradient descent – matrix operations

OK, I managed to find it out, after making my python code work:

import numpy as np

#sigmoid function
def sigmoid(x):
  return 1/(1+np.exp(-x))

#derivative of sigmoid function
def sigmoid_deriv(x):
  return x * (1 - x)

#MSE function
def compute_cost(y,o):
  return np.average(np.power(y - o, 2))

#prepend ones for multiplication with biases b1 b2 b3
def prepend_ones(X):
  return np.hstack([np.ones((X.shape[0],1)), X])

#init all biases to 0 and all weights to 1
def initialize_model():
  W1 = np.array([
    [0,0], # b1 b2
    [1,1], # w1 w3
    [1,1]  # w2 w4
  ], dtype=np.float64).T

  W2 = np.array([
    [0], # b3
    [1], # w5
    [1]  # w6
  ], dtype=np.float64).T
  return W1.T,W2.T

def forward_prop(x, W1, W2):
  h = prepend_ones(x).dot(W1)
  h = sigmoid(h)
  o = prepend_ones(h).dot(W2)
  o = sigmoid(o)
  return o, h

def back_prop(x,y,o,h,W1,W2):
  E = y - o
  dZ = -2*E*sigmoid_deriv(o)
  Wz = W2[1:].T
  dH = dZ.dot(Wz) * sigmoid_deriv(h)
  grad_W1 = prepend_ones(x).T.dot(dH)
  grad_W2 = prepend_ones(h).T.dot(dZ)
  return grad_W1, grad_W2

def get_batch():
  data = np.array([
    [-2, -1],
    [25, 6],
    [17, 4],
    [-15, -6],
  ])
  labels = np.array([
    [1],
    [0],
    [0],
    [1],
  ])
  return data, labels

def gradient_descent(alpha=0.1):
  W1, W2 = initialize_model()
  for _ in range(0, 1000): #1000 epochs
      x,y = get_batch() #batch is always the same - all input samples
      o,h = forward_prop(x,W1,W2)
      cost = compute_cost(y,o)
      print(f"cost is {cost}")
      grad_W1, grad_W2 = back_prop(x,y,o,h,W1,W2)
      W1 = W1 - alpha*grad_W1
      W2 = W2 - alpha*grad_W2
  return W1, W2

W1,W2 = gradient_descent()

Seems that after all that I can use SUM(gradients) and not AVG(gradients) - it will converge in the end to the same thing? At least the result I got in the end is the same.

And this is the calc. I had to do to have all in place:

$$ \begin{bmatrix} \Sigma(\frac{\partial MSE_{x_f}}{\partial b_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial b_2})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_1}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_3})\\ \Sigma(\frac{\partial MSE_{x_f}}{\partial w_2}) & \Sigma(\frac{\partial MSE_{x_f}}{\partial w_4}) \end{bmatrix} =X^T\times\left(\left(\left(-2 \begin{bmatrix} y_{true_{1}}-S(o_1)\\ y_{true_{2}}-S(o_2)\\ y_{true_{3}}-S(o_3)\\ \end{bmatrix} \begin{bmatrix} (1-S(o_1))S(o_1)\\ (1-S(o_2))S(o_2)\\ (1-S(o_3))S(o_3)\\ \end{bmatrix} \right)\times \begin{bmatrix} w_5 & w_6\\ \end{bmatrix} \right) \begin{bmatrix} (1-S(h_{11}))S(h_{11}) & (1-S(h_{12}))S(h_{12})\\ (1-S(h_{21}))S(h_{21}) & (1-S(h_{22}))S(h_{22})\\ (1-S(h_{31}))S(h_{31}) & (1-S(h_{32}))S(h_{32})\\ \end{bmatrix} \right) $$

The biggest problem was Wz = W2[1:].T where I had to remove $b_3$ element from matrix - that is to get $\begin{bmatrix} w_5 & w_6\\ \end{bmatrix}$ in order for all operations to complete. I guess if we have more hidden layers, we would need to remove bias term each time we go in back-propagation phase.

So still have to find out what is the difference between SUM and AVG of gradients - and is there some way what I do not have to remove bias terms from matrices when going in back-prop?

Best Answer

Related Solutions

[Math] Quaternion derivative w.r.t. its angle

Batch gradient descent – matrix operations

Related Question