Solved – Difference in using normalized gradient and gradient

algorithmsoptimization

In general setting of gradient descent algorithm, we have $x_{n+1} = x_{n} – \eta * gradient_{x_n}$ where $x_n$ is the current point, $\eta$ is the step size and $gradient_{x_n}$ is the gradient evaluated at $x_n$.

I have seen in some algorithm, people uses normalized gradient instead of gradient. I wanted to know what is the difference in using normalized gradient and simply gradient.

Best Answer

In a gradient descent algorithm, the algorithm proceeds by finding a direction along which you can find the optimal solution. The optimal direction turns out to be the gradient. However, since we are only interested in the direction and not necessarily how far we move along that direction, we are usually not interested in the magnitude of the gradient. Thereby, normalized gradient is good enough for our purposes and we let $\eta$ dictate how far we want to move in the computed direction. However, if you use unnormalized gradient descent, then at any point, the distance you move in the optimal direction is dictated by the magnitude of the gradient (in essence dictated by the surface of the objective function i.e a point on a steep surface will have high magnitude whereas a point on the fairly flat surface will have low magnitude).

From the above, you might have realized that normalization of gradient is an added controlling power that you get (whether it is useful or not is something upto your specific application). What I mean by the above is:
1] If you want to ensure that your algorithm moves in fixed step sizes in every iteration, then you might want to use normalized gradient descent with fixed $\eta$.
2] If you want to ensure that your algorithm moves in step sizes which is dictated precisely by you, then again you may want to use normalized gradient descent with your specific function for step size encoded into $\eta$.
3] If you want to let the magnitude of the gradient dictate the step size, then you will use unnormalized gradient descent. There are several other variants like you can let the magnitude of the gradient decide the step size, but you put a cap on it and so on.

Now, step size clearly has influence on the speed of convergence and stability. Which of the above step sizes works best depends purely on your application (i.e objective function). In certain cases, the relationship between speed of convergence, stability and step size can be analyzed. This relationship then may give a hint as to whether you would want to go with normalized or unnormalized gradient descent.

To summarize, there is no difference between normalized and unnormalized gradient descent (as far as the theory behind the algorithm goes). However, it has practical impact on the speed of convergence and stability. The choice of one over the other is purely based on the application/objective at hand.

Related Question