Solved – The Two Cultures: statistics vs. machine learning

machine learningpac-learning

Last year, I read a blog post from Brendan O'Connor entitled "Statistics vs. Machine Learning, fight!" that discussed some of the differences between the two fields. Andrew Gelman responded favorably to this:

Simon Blomberg:

From R's fortunes
package: To paraphrase provocatively,
'machine learning is statistics minus
any checking of models and
assumptions'.
— Brian D. Ripley (about the difference between machine learning
and statistics) useR! 2004, Vienna
(May 2004) 🙂 Season's Greetings!

Andrew Gelman:

In that case, maybe we should get rid
of checking of models and assumptions
more often. Then maybe we'd be able to
solve some of the problems that the
machine learning people can solve but
we can't!

There was also the "Statistical Modeling: The Two Cultures" paper by Leo Breiman in 2001 which argued that statisticians rely too heavily on data modeling, and that machine learning techniques are making progress by instead relying on the predictive accuracy of models.

Has the statistics field changed over the last decade in response to these critiques? Do the two cultures still exist or has statistics grown to embrace machine learning techniques such as neural networks and support vector machines?

Best Answer

I think the answer to your first question is simply in the affirmative. Take any issue of Statistical Science, JASA, Annals of Statistics of the past 10 years and you'll find papers on boosting, SVM, and neural networks, although this area is less active now. Statisticians have appropriated the work of Valiant and Vapnik, but on the other side, computer scientists have absorbed the work of Donoho and Talagrand. I don't think there is much difference in scope and methods any more. I have never bought Breiman's argument that CS people were only interested in minimizing loss using whatever works. That view was heavily influenced by his participation in Neural Networks conferences and his consulting work; but PAC, SVMs, Boosting have all solid foundations. And today, unlike 2001, Statistics is more concerned with finite-sample properties, algorithms and massive datasets.

But I think that there are still three important differences that are not going away soon.

  1. Methodological Statistics papers are still overwhelmingly formal and deductive, whereas Machine Learning researchers are more tolerant of new approaches even if they don't come with a proof attached;
  2. The ML community primarily shares new results and publications in conferences and related proceedings, whereas statisticians use journal papers. This slows down progress in Statistics and identification of star researchers. John Langford has a nice post on the subject from a while back;
  3. Statistics still covers areas that are (for now) of little concern to ML, such as survey design, sampling, industrial Statistics etc.