Gradient descent – why the partial derivative

artificial intelligencegradient descentmachine learning

I'm quite new to AI/ML, and I was learning about gradient descent. I saw this equation that explained the gradient descent algorithm:

I quite understood everything except the reason this equation uses the partial derivative of the cost function with respect to θ_j. The instructor I was following said that the derivative is used to find the lowest value for the cost function J(θ₀θ₁).

Why is this used? How does it minimize θ_j?

Best Answer

You use a vector of partial derivatives

also known as the gradient.

In vector form the equation is

$$\begin{bmatrix}\theta_0 \\ \theta_1 \end{bmatrix} := \begin{bmatrix}\theta_0 \\ \theta_1 \end{bmatrix} - \alpha\begin{bmatrix}\frac{\partial}{\partial \theta_0} \\ \frac{\partial}{\partial \theta_1} \end{bmatrix} J(\theta_0,\theta_1) $$

Path along the slope of a surface

The gradient is the direction along which the function has the largest increase (and you take a step $-\alpha$ in opposite direction).

With the descent algorithm, you take steps down the slope,

each coordinate is updated according to it's derivative
effectively that is like following the direction of the gradient.

Below is an example image from this question. The image shows how the gradient descent follows a path along the slope of the function, moving down to the minimum value.

I have placed on top of it some extra arrows near the first step in the top. These arrows show the first step can be decomposed into two components, one for each coordinate. These steps are the single derivatives that you have in your equation.

Best Answer

You use a vector of partial derivatives

Path along the slope of a surface

Related Solutions

Solved – Do we need gradient descent to find the coefficients of a linear regression model

Solved – Why is optimal learning rate obtained from analyzing gradient descent algorithm rarely (never) used in practice

Related Question