Solved – Gradient Descent: Guaranteeing Cost Function Decreases

gradient descentoptimization

I'm reading this and am a bit confused starting around equation (9).

Suppose we have a real-valued function of many variables, $$v = v_1, v_2, …$$

Let the gradient of our cost function, C, be: $$\nabla C = \left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}, …\right) ^T $$

Then:

$$\Delta C \approx \nabla C \bullet \Delta V$$

Suppose we choose:

$$\Delta v = -n \nabla C,$$ where $n$ is a small, positive parameter.

Then we have:

$$\Delta C \approx -n \nabla C \bullet \nabla C = -n |\nabla C|^2$$

Since $$|\nabla C|^2 \geq 0,$$ this means that $$\Delta C \leq 0,$$ aka that C will always decrease, never increase, if we change $v$ as described above.

What's wrong with the above argument? I could easily imagine a non-parabolic function where I choose changes in $v$ that lead to increases in $C$…

Best Answer

When a function is differentiable it is locally linear, and the error in the linear approximation is negligible in a sufficiently small neighborhood. If you take a small enough step, you are inside that neighborhood and therefore walking downhill on a nearly constant slope.

To find that small-enough step, many gradient descent methods contain a backtracking line search along the direction of steepest descent $-\nabla C$: one tries a certain step size $n$, and if it does not give a decrease in $C$, one cuts the step size in half until it does.

Best Answer

Related Solutions

Solved – Batch gradient descent versus stochastic gradient descent

Solved – Preconditioning gradient descent

Related Question