Machine Learning – Why is Gradient-free Statistical Learning Using Simulated Annealing Not Mainstream in Deep Learning?

In order to define what deep learning is, the learning portion is often listed with backpropagation as a requirement without alternatives in the main stream software libraries and in the literature. There are not many gradient free optimisations are mentioned in deep learning or in general statistical learning. Similarly, in "classical algorithms" (Nonlinear least squares) involves derivatives [1]. In general, gradient free learning in deep learning or classical algorithms are not in the main stream. One promising alternative is simulated annealing [2, 3], so-called 'nature-inspired optimization'.

Is there any inherent theoretical reason that why gradient free deep learning (statistical learning) is not in the main stream? (Or not preferred?)

Notes

[1] Such as Levenberg–Marquardt

[2] Simulated Annealing Algorithm for Deep Learning (2015)

[3] CoolMomentum: a method for stochastic optimization by Langevin dynamics with simulated annealing (2021) Though this is still not fully gradient-free, but does not require auto-differentiation.

Edit 1
Additional references using Ensemble Kalman Filter, showing a derivative free approach:

Ensemble Kalman Inversion: A Derivative-Free Technique For Machine Learning Tasks arXiv:1808.03620.
Ensemble Kalman Filter optimizing Deep Neural Networks: An alternative approach to non-performing Gradient Descent springer (manuscript-pdf)

Edit 2
As far as I gather, Yann LeCun does not consider gradient-free learning as part of deep learning ecosystem. "DL is constructing networks of parameterized functional modules & training them from examples using gradient-based optimization." tweet

Edit 3
Ben Bolker's comment on local geometry definitely deserves to be one of the answers.

Best Answer

Gradient-free learning is in the mainstream very heavily, but not used heavily in deep learning. Methods used for training neural networks that don't involve derivatives are typically called "metaheuristics." In computer science and pattern recognition (which largely originated in electrical engineering), metaheuristics are the go-to for NP-hard problems, such as airline flight scheduling, traffic route planning to optimize fuel consumption by delivery trucks, or the traveling salesman problem (annealing). As an example see swarm-based learning for neural networks or genetic algorithms for training neural networks or use of a metaheuristic for training a convolutional neural network. These are all neural networks which use metaheuristics for learning, and not derivatives.

While metaheuristics encompasses a wide swath of the literature, they're just not strongly associated with deep-learning, as these are different areas of optimization. Look up "solving NP-hard problems with metaheuristics." Last, recall that gradients used for neural networks don't have anything to do with the derivatives of a function that a neural network can be used to minimize (maximize). (This would be called function approximation using a neural network as opposed to classification analysis via neural network.) They're merely derivatives of the error or cross-entropy with respect to connection weight change within the network.

In addition, the derivatives of a function may not be known, or the problem can be too complex for using derivatives. Some of the newer optimization methods involve finite differencing as a replacement for derivatives, since compute times are getting faster, and derivative-free methods are becoming less computationally expensive in the time complexity.

Best Answer

Related Solutions

Solved – Why isn’t “Saddle-Free Newton” descent algorithm used in practice

Solved – The actual role of second-order optimization as oppose to first-order optimizations

Related Question