Gradient descent – why the partial derivative

artificial intelligencegradient descentmachine learning

I'm quite new to AI/ML, and I was learning about gradient descent. I saw this equation that explained the gradient descent algorithm:
equation

I quite understood everything except the reason this equation uses the partial derivative of the cost function with respect to θj. The instructor I was following said that the derivative is used to find the lowest value for the cost function J(θ0θ1).

Why is this used? How does it minimize θj?

Best Answer

You use a vector of partial derivatives

also known as the gradient.

In vector form the equation is

$$\begin{bmatrix}\theta_0 \\ \theta_1 \end{bmatrix} := \begin{bmatrix}\theta_0 \\ \theta_1 \end{bmatrix} - \alpha\begin{bmatrix}\frac{\partial}{\partial \theta_0} \\ \frac{\partial}{\partial \theta_1} \end{bmatrix} J(\theta_0,\theta_1) $$


Path along the slope of a surface

The gradient is the direction along which the function has the largest increase (and you take a step $-\alpha$ in opposite direction).

With the descent algorithm, you take steps down the slope,

  • each coordinate is updated according to it's derivative
  • effectively that is like following the direction of the gradient.

Below is an example image from this question. The image shows how the gradient descent follows a path along the slope of the function, moving down to the minimum value.

I have placed on top of it some extra arrows near the first step in the top. These arrows show the first step can be decomposed into two components, one for each coordinate. These steps are the single derivatives that you have in your equation.

example