[Math] Gradient of a function as the direction of steepest ascent/descent

gradient descentlinear algebra

I am trying to really understand why the gradient of a function gives the direction of steepest ascent intuitively.

Assuming that the function is differentiable at the point in question,
a) I had a look at a few resources online and also looked at this
Why is gradient the direction of steepest ascent? , a popular question on this stackexchange site.
The accepted answer basically says that we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? This to me really does not answer the question, but it has 31 upvotes so can someone please point out what I am obviously missing?

b) Does the gradient of a function tell us a way to reach the maxima or minima? if yes, then how and which one – maxima or minima or both?
Edit: I read the gradient descent algorithm and that answers this part of my question.

c) Since gradient is a feature of the function at some particular point – am I right in assuming that it can only point to the local maxima or minima?

Best Answer

The question is how you would measure the steepness of ascent. For one-dimensional functions, steepness is defined in terms of the derivative:

$$g^\prime(x) \equiv \lim_{h \rightarrow 0}\frac{f(x+h)-f(x)}{h}$$

By this limit definition, steepness is measured by computing the slope between the points $\langle x, f(x)\rangle$ and $\langle x + h, f(x+h)\rangle$, and letting that distance $h$ get smaller and smaller.


Now the question is how we extend this idea of steepness to functions of more than one variable.

Trick #1: Directional steepness requires only ordinary derivatives

Suppose we have a two-variable function $f(x,y)$. (Conceptually, the graph of $f$ is a surface hovering above the $xy$ plane.) Because we are presumably just learning multivariable calculus, we don't have a mathematical definition for the "steepness" at a point $\langle x,y\rangle$. However, there is a trick:

Suppose you pick a point $\langle x_0, y_0\rangle$. And you also pick a direction, in the form of a line like $2y = 3x$. You can see how the height of the function $f$ varies as you start at the point $\langle x_0, y_0 \rangle$ and take small steps in the direction of the line. You can compute this directional steepness using only the ordinary (one-dimensional) derivative.

In fact the equation is something like this:

$$D_{2y=3x} f = \lim_{h\rightarrow 0}\frac{f(x_0 + 2h, y_0 + 3h) - f(x_0, y_0)}{h}$$

(Advanced side note: this definition really is just a one-dimensional derivative. If I parameterize the line $2y=3x$ using a function like $u(t) = \langle 2t, 3t\rangle$, I can define the directional derivative as just $$D_u f \equiv D(f\circ u)(0).$$ To put it in more standard notation, $D_u f \equiv [\frac{d}{dt}f(u(t)) ]_{t=0}$ )

Trick #2: The gradient is a list of the steepness in each axis direction

In the previous section, we defined how to compute the direction steepness of a function — that is, the steepness in the direction of a line.

The lines along the coordinate axes are especially important. If we have a multivariable function $f(x_1, x_2, x_3, \ldots, x_n)$, let $\ell_1, \ell_2, \ldots \ell_n$ be lines, where $\ell_i$ is the line lying along the $x_i$ axis.

We'll define the gradient to be the list of directional steepnesses in each of the coordinate directions:

$$\nabla f = \langle D_{\ell_1}f, D_{\ell_2}f, \ldots, D_{\ell_n}f\rangle.$$

Let's think carefully about this structure. The function $f$ takes in a list of numbers $x_1,\ldots, x_n$ and produces a single number. The function $\nabla f$ takes in a list of $n$ numbers and produces a list of $n$ steepnesses (which are also numbers.)

Visually, you can imagine that $\nabla f$ takes in a point $\langle x_1, \ldots, x_n\rangle$ and produces a steepness vector at that point. The components of that vector are made up of the directional steepnesses of the function $f$ in the direction of the coordinate axes.

Trick #3: Dot products measure directional overlap

When $\vec{u}$ and $\vec{v}$ are vectors, then the dot product between $\vec{u}$ and $\vec{v}$ can be defined by

$$\vec{u}\cdot \vec{v} = ||\vec{u}|| \cdot ||\vec{v} || \cdot \cos{\theta},$$

where $\theta$ is the angle between the two vectors.

Now suppose $\vec{v}$ is kept constant. If we keep the length of $\vec{u}$ constant but allow it to revolve in a circle, for example, we can change the angle $\theta$ and see how it affects the dot product.

Evidently, the dot product is maximized when the two vectors are pointing in the same direction, because then $\cos{\theta}=\cos{0} = 1$ is maximal.

Trick #4: You can compute directional steepness using the dot product

Recall that $D_u f$ is the steepness of $f$ in the direction of some line $u$. Recall that $\nabla f$ is the gradient of $f$— a list of the directional steepnesses in each of the coordinate directions.

It turns out that the following fact is true:

If $u(t) = \langle at, bt\rangle$ is the parametrization of a line, and if $u(t)$ has length 1 when $t=1$, then $$D_u(f) = \nabla f \cdot u(1) $$ In other words, we can compute the directional steepness as the dot product of the gradient and the line of the direction.

Conclusion: The graident is the direction of steepest ascent Because we can compute directional steepness as a dot product with the gradient, the answer to the question: "In which direction is this function steepest?" is the same as the answer to the question "Which line will have the greatest dot product with the gradient?", which we know is "The line which is parallel to the gradient!".