Why in (stochastic) gradient descent the gradient vector (including length) is considered instead of just its direction

gradient descentmachine learning

When implementing gradient descent (or stochastic gradient descent) I don't understand why instead of just the DIRECTION of the gradient the actual gradient vector is used.

I'm actually using the normalized direction (using $L^\infty$ norm) multiplied by learning rate (also dynamically updated during search) and it seems to work better.

What confuses me about the use of gradient vector instead of gradient direction is that for example minimizing $F(x)$ or minimizing $\alpha + \beta F(x)$ will impact exploration of parameter space when instead it's just a scaling+shifting of the loss function and, in my opinion, it shouldn't change the way we look for the optimum location in parameter space (ideally optimizing $F(x)$ or $g(F(x))$ when $g$ is any smooth monotonic increasing function $ℝ → ℝ$ should in my opinion lead to an equivalent exploration).

Clearly when $g:ℝ→ℝ$ is a smooth increasing function the direction of $\nabla g(f(x))$ is the same as the one of $\nabla f(x)$ but the length is not:
$$
\nabla g(f(x)) = g'(f(x)) \nabla f(x)
$$

What is the logical explanation for using the gradient vector instead of just the direction of the gradient vector?

Best Answer

It depends on the assumptions you want to place on the function you are optimizing and on the step size schedule. For example, if the function is convex and $L$-Lipschitz, as is often assumed, then we know that:

$$f(x^+) \leq f(x) + \nabla f(x)^T\left(-t\frac{\nabla f(x)}{\|\nabla f(x)\|} \right ) + \frac{L}{2} \left\| t \frac{\nabla f(x)}{\|\nabla f(x)\|}\right\|^2$$

when we write $x^+ = x - t\nabla f(x)/\|\nabla f(x)\|$ as you've defined the iteration. Can you see what is the optimal choice of $t$ in this case (just set the derivative equal to $0$)?

Another way to think about it is that using the gradient direction and norm provides you with more information about the function you are optimizing than just using the direction. The norm of the gradient tells you how fast the function is changing at that point, so it makes sense that you would want to incorporate that information into your choice of step. This is similar to why using second order information also helps with convergence (for example in newtons method).

As for your idea about using a smaller step size when the next value is too large, I am not aware of any proof showing that that technique converges faster than regular gradient descent, but it seems like a reasonable approach to me.

Related Question