[Math] How does one choose the step size for steepest descent

machine learningnonlinear optimizationnumerical methodsnumerical optimizationoptimization

Consider finding the minimal value for any function $g$ from $\mathbb{R}^n$ to $\mathbb{R}$. The method of steepest descent for finding a local minimum for an arbitrary function $g$ from from $\mathbb{R}^n$ to $\mathbb{R}$ can be intuitively described as follows:

  1. Evaluate $g$ at an initial approximation $x^{(0)} = (x^{(0)}_{1}, …, x^{(0)}_{D})^t$
  2. Determine a direction from $x^{(0)}$ that results in a decrease in the value of $g$ (or the greatest decrease by using $\bigtriangledown g(x) $, the gradient of $g$, as in gradient descent)
  3. Move an appropriate amount in this direction and call the new value $x^{(1)}$.
  4. Repeat steps 1 through 3 with $x^{(0)}$ replaced by $x^{(1)}$.

Usually step 3 is implemented as follow (since the object is to reduce $g(x)$ to its minial value):

$$ x^{(1)} = x^{(0)} – \alpha \bigtriangledown g(x^{(0)} )$$

for some constant $ \alpha > 0$.

However, to us that descent equation one has to choose an appropriate value of $\alpha$ so that $g(x^{(1)} )$ is less that $g(x^{(0)} )$. According to the textbook numerical analysis book by Burden and Faires to determine an appropriate choice for the value of $\alpha$, we consider the single-variable function:

$$ h(\alpha) = g(x^{(0)} – \alpha \bigtriangledown g( x^{(0)} )$$

according to them, the value of $\alpha$ that minimizes $h$ is the value needed for the gradient descent equation.

What I don't understand is;

  1. why is that the optimal value of $\alpha$? ( i.e. the solution to the minimization of $h( \alpha )$?)
  2. How is "optimality" value of alpha even mean in this case and why does minimizing $h( \alpha)$ capture this definition of optimality? Maybe that is the choice of alpha that gives the biggest step size possible given the computed gradient?
  3. What is the proof or the rigorous justification that this equation is the right one to optimize to choose the step size? Why is that the correct equation and not something else? Is there a rigorous why to explain why that is the best alpha?
  4. What type of guarantee's does that choice of $\alpha$ give us? If we choose that $alpha$, are we guaranteed that if we keep doing iterations of steepest descent, that we will never overshoot? Will it be in an infinite number of iterations or finite iterations?
  5. Does this mean that we choose a new step size for every step of Steepest descent?

Best Answer

Notice that, by definition

$$h(\alpha) \equiv g\left(x^{(0)} - \alpha \nabla g\left( x^{(0)} \right)\right) = g\left( x^{(1)} \right).$$

Since the goal is to choose the step with the deepest descent, this can be achieved by choosing $\alpha$ to minimize $h(\alpha)$. This is equivalent to solving

$$ \min_\alpha g\left( x^{(1)} \right),\;\;\;\text{ subject to}\;\; x^{(1)}=x^{(0)} - \alpha \nabla g\left( x^{(0)} \right).$$