Stochastic Gradient Descent – Who Invented This Optimization Algorithm?

gradient descenthistoryreferencesstochastic gradient descent

I'm trying to understand the history of Gradient descent and Stochastic gradient descent. Gradient descent was invented in Cauchy in 1847.Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 For more information about it see here.

Since then gradient descent methods kept developing and I'm not familiar with their history. In particular I'm interested in the invention of stochastic gradient descent.

A reference that can be used in an academic paper in more than welcomed.

Best Answer

Stochastic Gradient Descent is preceded by Stochastic Approximation as first described by Robbins and Monro in their paper, A Stochastic Approximation Method. Kiefer and Wolfowitz subsequently published their paper, *Stochastic Estimation of the Maximum of a Regression Function* which is more recognizable to people familiar with the ML variant of Stochastic Approximation (i.e Stochastic Gradient Descent), as pointed out by Mark Stone in the comments. The 60's saw plenty of research along that vein -- Dvoretzky, Powell, Blum all published results that we take for granted today. It is a relatively minor leap to get from the Robbins and Monro method to the Kiefer Wolfowitz method, and merely a reframing of the problem to then get to Stochastic Gradient Descent (for regression problems). The above papers are widely cited as being the antecedents of Stochastic Gradient Descent, as mentioned in this review paper by Nocedal, Bottou, and Curtis, which provides a brief historical perspective from a Machine Learning point of view.

I believe that Kushner and Yin in their book Stochastic Approximation and Recursive Algorithms and Applications suggest that the notion had been used in control theory as far back as the 40's, but I don't recall if they had a citation for that or if it was anecdotal, nor do I have access to their book to confirm this.

Herbert Robbins and Sutton Monro A Stochastic Approximation Method The Annals of Mathematical Statistics, Vol. 22, No. 3. (Sep., 1951), pp. 400-407, DOI: 10.1214/aoms/1177729586

J. Kiefer and J. Wolfowitz Stochastic Estimation of the Maximum of a Regression Function Ann. Math. Statist. Volume 23, Number 3 (1952), 462-466, DOI: 10.1214/aoms/1177729392

Leon Bottou and Frank E. Curtis and Jorge Nocedal Optimization Methods for Large-Scale Machine Learning, Technical Report, arXiv:1606.04838