Solved – Gradient descent based minimization algorithm that doesn’t require initial guess to be near the global optimum

The problem with gradient descent algorithms, e.g. the Levenberg–Marquardt algorithm, is that they will converge into a minimum that is nearest to the initial guess, so when starting from different positions you will end up in different minima.

Is there a gradient based algorithm that gives similar results no matter where it starts? e.g. one that will escape shallow minimum, and therefore is more likely to converge into a global minimum, or at least a local minimum that is similar to the global. This could be achieved by giving the gradient descent "inertia", so as the alternative moves down the partial derivatives, it gains momentum, and then once all the partial derivatives equal zero (i.e. a minimum), the algorithm continues to move in the direction it was going before, but starts to slow down, as it is now going up the partial derivatives. This would help it escape local minimum and find global minimum.

Does an algorithm like this exist? (That's not a global optimization algorithm like the genetic algorithm or particle swarm optimization)

Best Answer

This is more of a workaround than a direct solution, but a common way to avoid local minima is to run your algorithm several times, from different starting locations. You can then take the best outcome or the average as your final result.

The reason why you might want to take the average rather than the best is to avoid overfitting. Many model types where local minima are a problem have lots of parameters: decision trees, neural networks, etc. Simply taking the best outcome risks obtaining a model that won't generalise well to future data. Taking the average guards against this.

You can get into arbitrarily complex ways of doing the averaging. Have a look at

Best Answer

Related Solutions

Solved – Gradient Based Learning Algorithms vs Global Optimization Learning Algorithms for Neural Networks

Solved – For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value

Related Question