Neural Network Training – Possibility of Training Without Backpropagation

backpropagationmachine learningneural networksoptimization

Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient.

Let's assume we are building a model with ~10K parameters / weights. Is it possible to run the optimization using some gradient free optimization algorithms?

I think computing the numerical gradient would be too slow, but how about other methods such as Nelder-Mead, Simulated Annealing or a Genetic Algorithm?

All the algorithms would suffer from local minima, why obsessed with gradient?

Best Answer

The first two algorithms you mention (Nelder-Mead and Simulated Annealing) are generally considered pretty much obsolete in optimization circles, as there are much better alternatives which are both more reliable and less costly. Genetic algorithms covers a wide range, and some of these can be reasonable.

However, in the broader class of derivative-free optimization (DFO) algorithms, there are many which are significantly better than these "classics", as this has been an active area of research in recent decades. So, might some of these newer approaches be reasonable for deep learning?

A relatively recent paper comparing the state of the art is the following:

Rios, L. M., & Sahinidis, N. V. (2013) Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization.

This is a nice paper which has many interesting insights into recent techniques. For example, the results clearly show that the best local optimizers are all "model-based", using different forms of sequential quadratic programming (SQP).

However, as noted in their abstract "We find that the ability of all these solvers to obtain good solutions diminishes with increasing problem size." To give an idea of the numbers, for all problems the solvers were given a budget of 2500 function evaluations, and problem sizes were a maximum of ~300 parameters to optimize. Beyond O[10] parameters, very few of these optimizers performed very well, and even the best ones showed a noticable decay in performance as problem size was increased.

So for very high dimensional problems, DFO algorithms just are not competitive with derivative based ones. To give some perspective, PDE (partial differential equation)-based optimization is another area with very high dimensional problems (e.g. several parameter for each cell of a large 3D finite element grid). In this realm, the "adjoint method" is one of the most used methods. This is also a gradient-descent optimizer based on automatic differentiation of a forward model code.

The closest to a high-dimensional DFO optimizer is perhaps the Ensemble Kalman Filter, used for assimilating data into complex PDE simulations, e.g. weather models. Interestingly, this is essentially an SQP approach, but with a Bayesian-Gaussian interpretation (so the quadratic model is positive definite, i.e. no saddle points). But I do not think that the number of parameters or observations in these applications is comparable to what is seen in deep learning.

Side note (local minima): From the little I have read on deep learning, I think the consensus is that it is saddle points rather than local minima, which are most problematic for high dimensional NN-parameter spaces.

For example, the recent review in Nature says "Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder."

A related concern is about local vs. global optimization (for example this question pointed out in the comments). While I do not do deep learning, in my experience overfitting is definitely a valid concern. In my opinion, global optimization methods are most suited for engineering design problems that do not strongly depend on "natural" data. In data assimilation problems, any current global minima could easily change upon addition of new data (caveat: My experience is concentrated in geoscience problems, where data is generally "sparse" relative to model capacity).

An interesting perspective is perhaps

O. Bousquet & L. Bottou (2008) The tradeoffs of large scale learning. NIPS.

which provides semi-theoretical arguments on why and when approximate optimization may be preferable in practice.

End note (meta-optimization): While gradient based techniques seem likely to be dominant for training networks, there may be a role for DFO in associated meta-optimization tasks.

One example would be hyper-parameter tuning. (Interestingly, the successful model-based DFO optimizers from Rios & Sahinidis could be seen as essentially solving a sequence of design-of-experiments/response-surface problems.)

Another example might be designing architectures, in terms of the set-up of layers (e.g. number, type, sequence, nodes/layer). In this discrete-optimization context genetic-style algorithms may be more appropriate. Note that here I am thinking of the case where connectivity is determined implicitly by these factors (e.g. fully-connected layers, convolutional layers, etc.). In other words the $\mathrm{O}[N^2]$ connectivity is $not$ meta-optimized explicitly. (The connection strength would fall under training, where e.g. sparsity could be promoted by $L_1$ regularization and/or ReLU activations ... these choices could be meta-optimized however.)