There are IMHO no formal differences that distinguish machine learning and statistics at the fundamental level of fitting models to data. There may be cultural differences in the choice of models, the objectives of fitting models to data, and to some extend the interpretations.
In the typical examples I can think of we always have
- a collection of models $M_i$ for $i \in I$ for some index set $I$,
- and for each $i$ an unknown component $\theta_i$ (the parameters, may be infinite dimensional) of the model $M_i$.
Fitting $M_i$ to data is almost always a mathematical optimization problem consisting of finding the optimal choice of the unknown component $\theta_i$ to make $M_i$ fit the data as measured by some favorite function.
The selection among the models $M_i$ is less standard, and there is a range of techniques available. If the objective of the model fitting is purely predictive, the model selection is done with an attempt to get good predictive performance, whereas if the primary objective is to interpret the resulting models, more easily interpretable models may be selected over other models even if their predictive power is expected to be worse.
What could be called old school statistical model selection is based on statistical tests perhaps combined with step-wise selection strategies, whereas machine learning model selection typically focuses on the expected generalization error, which is often estimated using cross-validation. Current developments in and understandings of model selection do, however, seem to converge towards a more common ground, see, for instance, Model Selection and Model Averaging.
Inferring causality from models
The crux of the matter is how we can interpret a model? If the data obtained are from a carefully designed experiment and the model is adequate it is plausible that we can interpret the effect of a change of a variable in the model as a causal effect, and if we repeat the experiment and intervene on this particular variable we can expect to observe the estimated effect. If, however, the data are observational, we can not expect that estimated effects in the model correspond to observable intervention effects. This will require additional assumptions irrespectively of whether the model is a "machine learning model" or a "classical statistical model".
It may be that people trained in using classical statistical models with a focus on univariate parameter estimates and effect size interpretations are of the impression that a causal interpretation is more valid in this framework than in a machine learning framework. I would say it is not.
The area of causal inference in statistics does not really remove the problem, but it does make the assumptions upon which causal conclusions rest explicit. They are referred to as untestable assumptions. The paper Causal inference in statistics: An overview by Judea Pearl is a good paper to read. A major contribution from causal inference is the collection of methods for the estimation of causal effects under assumptions where there actually are unobserved confounders, which is otherwise a major concern. See Section 3.3 in the Pearl paper above. A more advanced example can be found in the paper Marginal Structural Models and Causal Inference in Epidemiology.
It is a subject matter question whether the untestable assumptions hold. They are precisely untestable because we can not test them using the data. To justify the assumptions other arguments are required.
As an example of where machine learning and causal inference meets, the ideas of targeted maximum-likelihood estimation as presented in Targeted Maximum Likelihood Learning by Mark van der Laan and Daniel Rubin typically exploit machine learning techniques for non-parametric estimation followed by the "targeting" towards a parameter of interest. The latter could very well be a parameter with a causal interpretation. The idea in Super Learner is to rely heavily on machine learning techniques for estimation of parameters of interest. It is an important point by Mark van der Laan (personal communication) that classical, simple and "interpretable" statistical models are often wrong, which leads to biased estimators and too optimistic assessment of the uncertainty of the estimates.
I think this is a great question, and not an easy one to answer. I conceptualize that machine learning encompasses a lot of multivariate statistics, because many of the common techniques in multivariate analysis (ordination and clustering, for instance) use unsupervised learning algorithms. With people like me who aren't that concerned about the computer side of things, a lot of this stuff appears to be "under the hood", and I usually am focused more on how ordination relates as an extension of regression. But it cannot be ignored that the computer is doing some pretty advanced searching for patterns that I am not responsible for.
Then there are supervised learning techniques in machine learning outside the realm of regular multivariate analysis. For instance, if you want to predict what categories some new object would go into based upon some of its variable's values, then you can train the algorithm to a bunch of objects that you know the classification of and then set the algorithm on classifying the new object. This is clearly not a multivariate statistics technique, and I tend to think of this when I think ofmachine learning because it involves that process of communicating the success or failure of a search to the system. Then this is where machine learning starts to overlap with AI, and things quickly get completely out of my depth...
In the end, I do agree with the second answer on this thread that machine learning emphasizes prediction, whereas statisics in general is concerned with inference - but again, this is broad strokes stuff and not always going to be true.
Best Answer
I think the answer to your first question is simply in the affirmative. Take any issue of Statistical Science, JASA, Annals of Statistics of the past 10 years and you'll find papers on boosting, SVM, and neural networks, although this area is less active now. Statisticians have appropriated the work of Valiant and Vapnik, but on the other side, computer scientists have absorbed the work of Donoho and Talagrand. I don't think there is much difference in scope and methods any more. I have never bought Breiman's argument that CS people were only interested in minimizing loss using whatever works. That view was heavily influenced by his participation in Neural Networks conferences and his consulting work; but PAC, SVMs, Boosting have all solid foundations. And today, unlike 2001, Statistics is more concerned with finite-sample properties, algorithms and massive datasets.
But I think that there are still three important differences that are not going away soon.