Solved – Non-convex optimization without using gradient descent

non-convexoptimization

If we want to optimize a convex function, we could use methods like gradient descent or computing the derivative of the function and equalize to zero so we can obtain the global minimum.

But I wonder if it's possible to compute the derivative of a non-convex function to obtain all the local minima. Is this possible?

Are there methods other than gradient descents to optimize non-convex functions?

Best Answer

It really depends on the function. For some functions you can compute the derivative to obtain the local minima, e.g. $\sin x$ is non-convex but it is straightforward to find all its minima ($k\pi, k\in \mathbb{Z}$).

Gradient descent also works well on non-convex functions, but it will find a local minimum. There are techniques to "push" the gradient descent out of a local minimum hoping it will find a global minimum or a better local minimum.

For example, stochastic gradient descent calculates the gradient based on only one sample at a time. This not only speeds up computation of the gradient(because we only use one sample), it can also help to avoid getting stuck in a local minimum. This happens because single samples tend to be noisy, and that noise can be enough to push towards a better solution.

There are also second-order methods like the Newton's method, which uses the second derivative to optimize the step direction.

Two other widely used methods are the BFGS method and conjugate gradient descent.

This also depends on what exactly you are trying to achieve. If you are training a neural network, there are more specific techniques you can use. For example you can train multiple neural networks with different starting values. This will lead to different local minima during training and as a result you can take an ensemble of your different neural networks.

There are a ton of other methods you can use to optimize a non-convex function and it would be quite difficult(if at all possible) to include all of them in a post. There is also quite a lot of research being done in this area.

To answer your comment on why would you use gradient descent instead of just finding the local minima analytically, two main reasons come to mind: a) if an analytical solution exists, it might be computationally more difficult to compute than gradient descent. After you derive the formula for the gradient, you still need to set it to 0 and then solve it with respect to the parameters you are trying to optimize. That might involve calculating the inverse of some matrix. This is not necessarily true, if all you want is to calculate the gradient so that you can use it in gradient descent. b)It is possible that an analytical solution does not exist or it is too complicated. You can still approximate the gradient using numerical methods which you can then use in gradient descent.

Related Question