Gradient Descent (Geometric) – Why find ascent/descent in first iteration

gradient descentnumerical optimizationoptimization

There is one thing confusing me all the time in gradient descent methods.

Let's assume I want to apply gradient descent for linear regression:

$$\arg\min f(x,y) = \sum (g(x,y) – z)²$$

with $g(x,y) = \theta_1 + \theta_2 x + \theta_3 y$. Now, for gradient descent the following applies:

$$\theta_{k+1} = \theta – \alpha * \partial f(x,y)/\partial\theta$$

I hope that should be correct. So my question is: When I'm doing this step in the first iteration, i can tweak my parameters as shown in equation two.

How can I (the algorithm) know how to tweak my parameters with just one iteration? Dont I need at least two points to know how my function z (which i want to approximate) behaves? More geometrically: If I imagine a 3 dimensional (x,y,z) convex function and calculate one point on that surface. How can I know where the steepest ascent/descent is? Why dont I need to tweak each parameter seperatly in each direction e.g. (x+1,x-1,y+1,y-1) to know the gradient..? So my underlying function (which z in my error function is depending on) could be quite a simple first degree polynomial or a 5th degree polynomial and tweaking my parameters would always be the same.

I hope anyone can maximize my intuitive knowledge function 😉

Best Answer

The idea is that you do not need to do this if you have the gradient. The gradient exactly tells you which direction is the "best" (i.e., leading to fastest increase or decrease) around any point you evaluate it at.

In fact, your "parameter tweaking" is simply the finite difference approximation of the derivative. In other words, "tweaking" the parameters in each direction is simply an approximation of the gradient anyway (or equivalently the gradient is a finite differencing with an infinitely small step size). So it is doing what you think it is doing.

One thing you might wonder is how this can work over long distances. For instance, imagine a function with many "hills" - we won't be able to "see" far enough (because the gradient only looks very close by) to find our way over the hills and valleys. This is true. It is what people mean when they say that gradient descent methods get stuck in local minima. It may not find the global one.

As Wikipedia says:

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.

Note that it also only works for optimizing differentiable functions for local extrema. A very similar approach to the one you proposed that works in the discrete case is called hill climbing.