I'm reading this and am a bit confused starting around equation (9).
Suppose we have a real-valued function of many variables, $$v = v_1, v_2, …$$
Let the gradient of our cost function, C, be: $$\nabla C = \left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}, …\right) ^T $$
Then:
$$\Delta C \approx \nabla C \bullet \Delta V$$
Suppose we choose:
$$\Delta v = -n \nabla C,$$ where $n$ is a small, positive parameter.
Then we have:
$$\Delta C \approx -n \nabla C \bullet \nabla C = -n |\nabla C|^2$$
Since $$|\nabla C|^2 \geq 0,$$ this means that $$\Delta C \leq 0,$$ aka that C will always decrease, never increase, if we change $v$ as described above.
What's wrong with the above argument? I could easily imagine a non-parabolic function where I choose changes in $v$ that lead to increases in $C$…
Best Answer
When a function is differentiable it is locally linear, and the error in the linear approximation is negligible in a sufficiently small neighborhood. If you take a small enough step, you are inside that neighborhood and therefore walking downhill on a nearly constant slope.
To find that small-enough step, many gradient descent methods contain a backtracking line search along the direction of steepest descent $-\nabla C$: one tries a certain step size $n$, and if it does not give a decrease in $C$, one cuts the step size in half until it does.