Gradient-based optimization: A small change in the input obtains a corresponding change in the output

derivativesgradient descentmachine learningnumerical methodsreal-analysis

My textbook, Deep Learning by Goodfellow, Bengio, and Courville, says the following in a section on gradient-based optimization:

The derivative of this function is denoted as $f'(x)$ or as $\dfrac{dy}{dx}$. The derivative $f'(x)$ gives the slope of $f(x)$ as the point $x$. In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output: $f(x + \epsilon) \approx f(x) + \epsilon f'(x)$.

The derivative is therefore useful for minimizing a function because it tells us how to change $x$ in order to make a small improvement in $y$. For example, we know that $f(x – \epsilon \text{sign}(f'(x)))$ is less than $f(x)$ for small enough $\epsilon$. We can thus reduce $f(x)$ by moving $x$ in small steps with the opposite sign of the derivative.

Just to be clear, the reason why we can conclude that a small change in the input obtains a corresponding change in the output is because the function is continuous at every point in its domain (and must be so, since it is assumed to be differentiable)? Without this being the case, it seems to me that a small change in the input would not necessarily obtain a corresponding change in the output; and nor would the function necessarily be differentiable at every point in its domain, since there could be discontinuities or anomalies?

I just want to be confirm that my thoughts about this are correct. Thank you.

Best Answer

For a numerical search method, finding roots or extrema, to make sense at all you need that the evaluation at one point (or a finite set of points) represents the values and other behavior in some neighborhood. This means continuity of some kind, semi-continuity, piecewise continuity with smooth boundaries,...

In exploring the gradient descent (1), you need that the function is partially differentiable to compute the gradient, and continuously differentiable so that the computed gradient is actually the only gradient value(2) and that the Taylor formula, or Weierstraß decomposition formula, $$f(x+v)=f(x)+f′(x)v+o(v),$$ holds true. Then you can insert $v=-ϵ∇f(x)$ or $v=-ϵ\,{\rm sign}(f'(x))$ to get $$ f(x-ϵ∇f)=f(x)-ϵf'(x)∇f+o(ϵ)=f(x)-ϵ\|∇f\|^2+o(ϵ) \\~~ \text{ or } ~~ f(x-ϵ\,{\rm sign}(f'(x)))=f(x)-ϵ\sum|\partial_kf(x)|+o(ϵ). $$ In both cases you get a definite descent as long as the remainder $o(ϵ)$ is smaller than the terms linear in $ϵ$.

(1) In general, if you want to add a derivative to a point, you need to convert it to a gradient relative to some scalar product, the differential $f'(x)$ is a linear functional, the gradient $\nabla f(x)$ defined via $\langle \nabla f(x), v\rangle=f'(x)[v]$ is a proper vector or translation for the point space.

(2) Else you need at least generalized differentiability and some control on the convex hull of the derivative set.