Neural Networks – Why Use Gradient Methods Rather Than Other Metaheuristics in Neural Nets?

backpropagationdeep learninggradient descentneural networksoptimization

In training deep and shallow neural networks, why are gradient methods (e.g. gradient descent, Nesterov, Newton-Raphson) commonly used, as opposed to other metaheuristics?

By metaheuristics I mean methods such as simulated annealing, ant colony optimization, etc., which were developed to avoid getting stuck in a local minima.

Best Answer

Extending @Dikran Marsupial's answer....

Anna Choromanska and her colleagues in Yan LeCunn's group at NYU, address this in their 2014 AISTATS paper "The Loss Surface of Multilayer Nets". Using random matrix theory, along with some experiments, they argue that:

  • For large-size networks, most local minima are equivalent and yield similar performance on a test set.

  • The probability of finding a "bad" (high value) local minimum is non-zero for small-size networks and decreases quickly with networks size.

  • Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.

[From page 2 of the paper]

In this view, there's not a great reason to deploy heavy-weight approaches for finding the global minimum. That time would be better spent trying out new network topologies, features, data sets, etc.

That said, lots of people have thought about augmenting or replacing SGD. For fairly small networks (by contemporary standards), these improved metahuristics do seem to do something Mavrovouniotis and Yang (2016) show that ant colony optimization + backprop beats unmodified backprop on several benchmark data sets (albeit not by much). Rere el al. (2015) use simulated annealing to train a CNN and find it initially performs better on the validation set. After 10 epochs, however, only a a very small (and not-tested-for-significance) difference in performance remains. The faster convergence-per-epoch advantage is also offset by a dramatically larger amount of computation time per epoch, so this is not a obvious win for simulated annealing.

It is possible that these heuristics do a better job of initializing the network and once it has been pointed down the right path, any optimizer will do. Sutskever et al. (2013) from Geoff Hinton's group argue something like this in their 2013 ICML paper.