[Math] Gradient descent – why subtract gradient to update $m$ and $b$

These are the gradient descent formulas:
$$\frac{\delta }{\delta m} = \frac{2}{n}\sum-x_i(y_i – mx_i + b)\\
\frac{\delta }{\delta m} = \frac{2}{n}\sum-(y_i – mx_i + b) $$
And my understanding is they come from first taking the positive gradient is the partial derivatives of the function $ (y – mx + b)^2$.

This leads to
$$\frac{\delta J}{\delta m}( 2x(y – mx +b ))\\
\frac{\delta J}{\delta b}(2 \times (y – mx + b) \times 1)$$

Then to get the descent, we just add negatives to each partial derivative. So we are already descending.

But translating gradient descent into code, this is what I often see:

def linear_regression(X, y, m_current=0, b_current=0, epochs=1000, 
    learning_rate=0.0001):
    N = float(len(y))
    for i in range(epochs):
   y_current = (m_current * X) + b_current
   cost = sum([data**2 for data in (y-y_current)]) / N
   m_gradient = -(2/N) * sum(X * (y - y_current))
   b_gradient = -(2/N) * sum(y - y_current)
   m_current = m_current - (learning_rate * m_gradient)
   b_current = b_current - (learning_rate * b_gradient)
   return m_current, b_current, cost

My question is about the update to m_current and b_current in the final lines of the function. Why is m_current - learning rate * m_gradient and b_current - learning_rate * b_gradient?

Why not

  m_current = m_current + (learning_rate * m_gradient)
  b_current = b_current + (learning_rate * b_gradient)

Our gradient descents are already negative to point us towards descending along the cost curve, so why aren't we updating our m_current and b_current by just adding the respective gradients?

Best Answer

If we add the gradient to it, we are taking the step that is the steepest ascent, that is we are increasing our value locally, but here we are trying to minimize our objective function.

For example, suppose $y=2x$, the gradient is $2$, no matter where we are, adding $2$ to our current location increases the value of $y$.

Best Answer

Related Solutions

[Math] Machine Learning, why not use matrix multiplication instead of gradient descent

[Math] we update simultaneously all the variables in Gradient Descent

Related Question