These are the gradient descent formulas:
$$\frac{\delta }{\delta m} = \frac{2}{n}\sum-x_i(y_i – mx_i + b)\\
\frac{\delta }{\delta m} = \frac{2}{n}\sum-(y_i – mx_i + b) $$
And my understanding is they come from first taking the positive gradient is the partial derivatives of the function $ (y – mx + b)^2$.
This leads to
$$\frac{\delta J}{\delta m}( 2x(y – mx +b ))\\
\frac{\delta J}{\delta b}(2 \times (y – mx + b) \times 1)$$
Then to get the descent, we just add negatives to each partial derivative. So we are already descending.
But translating gradient descent into code, this is what I often see:
def linear_regression(X, y, m_current=0, b_current=0, epochs=1000,
learning_rate=0.0001):
N = float(len(y))
for i in range(epochs):
y_current = (m_current * X) + b_current
cost = sum([data**2 for data in (y-y_current)]) / N
m_gradient = -(2/N) * sum(X * (y - y_current))
b_gradient = -(2/N) * sum(y - y_current)
m_current = m_current - (learning_rate * m_gradient)
b_current = b_current - (learning_rate * b_gradient)
return m_current, b_current, cost
My question is about the update to m_current
and b_current
in the final lines of the function. Why is m_current - learning rate * m_gradient
and b_current - learning_rate * b_gradient
?
Why not
m_current = m_current + (learning_rate * m_gradient)
b_current = b_current + (learning_rate * b_gradient)
Our gradient descents are already negative to point us towards descending along the cost curve, so why aren't we updating our m_current
and b_current
by just adding the respective gradients?
Best Answer
If we add the gradient to it, we are taking the step that is the steepest ascent, that is we are increasing our value locally, but here we are trying to minimize our objective function.
For example, suppose $y=2x$, the gradient is $2$, no matter where we are, adding $2$ to our current location increases the value of $y$.