Solved – Gradient Based Learning Algorithms vs Global Optimization Learning Algorithms for Neural Networks

genetic algorithmsgradient descentneural networksoptimization

Neural Networks are usually trained using a gradient based learning algorithm, such as the back propagation algorithm or some variant of it, but can you use global optimization algorithms, such as the Genetic Algorithm, Nelder-Mead Polytope Algorithm, and Particle Swarm Optimisation to train the network?

Since training a neural network all boils down to minimizing a multi-variable cost function, I would assume that it should be easy to do this using a global optimization method, but I have tried to do it myself and I'm getting very bad results.

The GA algorithm reduces my cost function from around 250 (when it is input with random synaptic weights), to only around 170. The Nelder-Mead Algorithm would apparently take years, and I have not yet tried PSO as there is no inbuilt MATLAB function for it.

So is it just accepted that gradient based algorithms are the most suitable for training a Neural Network? If so can someone please point me towards a source of this information? This will be very helpful as I could reference the source in my project to explain why I gave up trying to use global optimization methods to train the network.

Thanks!

Best Answer

Reading your question, I understand that you think that "global optimization methods" always give the global optimum whatever the function you are working on. This idea is very wrong. First of all, some of these algorithm indeed give a global optimum, but in practice for most functions they don't... And don't forget about the no free lunch theorem.

Some classes of functions are easy to optimize (convex functions for example), but most of the time, and especially for neural network training criterion, you have : non linearity - non convexity (even though for simple network this is not the case)... These are pretty nasty functions to optimize (mostly because of the pathological curvatures they have). So Why gradient ? Because you have good properties for the first order methods, especially they are scalable. Why not higher order ? Newton method can't be applied because you have too much parameters and because of this you can't hope to effectively inverse the Hessian matrix.

So there are a lot of variants around which are based on second order method, but only rely on first order computations : hessian free optimization, Nesterov gradient, momentum method etc...

So yes, first order methods and approximate second order are the best we can do for now, because everything else doesn't work very well.

I suggest for further detail : "Learning deep architectures for AI" from Y. Bengio. The book : "Neural networks : tricks of the trade" and Ilya Sutskever phd thesis.

Related Question