- I wonder if statements such as the first one are not exaggerations in the sense that there is a huge number of 'real' images that could
be in the dataset but are not in it. Or in other words that the
ImageNet dataset is not representative of 'real image' even though it
obviously samples it.
And this is exactly what your quotes say, that those are samples from the distribution.
- Being a real image does not seem to be a probabilistic matter to me. An image either came from the real world or did not. It does not
matter that mechanical turk is used to make people guess if the
machine generated images look real like in the article. This does not
justify the 'probabilistic' interpretation of what the machine
learning algorithm is doing. In other words, don't statements such as
with a high probability (over some threshold) actually point to abuse or sweeping under the rug?
This depends on your definition of probability. It seems that in such cases as described above they implicitly adopt Bayesian understanding of probability, i.e. probability is a measure of how much believable something is. In such case you can talk about probabilities of non-repeatable events, for example what is the probability that the sun will rise tomorrow?. It will, or it will not, we have no data about such events happening in the past, so under a classical definition of probability this would not be a probabilistic problem, but it can be seen as probabilistic problem under Bayesian understanding of probability.
- I wonder if statements such as the last one are not stretching it. We do not know the 'true data distribution', we only know a small
number of samples of it. Using samples to measure loss is just a
(huge) assumption about the relation of the samples to the
distribution. That relation is swept under the rug.
And this is basically what statistics is about: we infer about properties of the population given only the limited sample from it, that we have. We learn about properties of the true distribution given only the empirical distribution obtained from our sample.
- I wonder if it is not possible to interpret what the machine learning algorithm does that does not rely on such terminology? Per
example simply talking about compression? And that the generalization
produced is merely an assumption on the continuity between the samples
(regularity)?
Yes, other models are possible, but it seems that the quotes you provided refer to probabilistic models. Again, statistics are grounded on probabilistic reasoning about data. Without probability it becomes much more problematic to quantify the uncertainty about the true distribution that rises from the fact that we are dealing with limited and imperfect data to infer about it.
The term "machine learning" is somewhat a term of art, but it generally refers to the construction of algorithms that "learn through experience". The requirement of learning through experience necessitates data, and so machine learning is necessarily "data-driven" --- after all, if not from data, what else would it learn from?
When we refer to a "model" in statistics or machine learning, we really just mean a set of assumptions that describe the presumed probabilistic process for the data, and the logical consequences of the assumptions (e.g., resulting distributions of statistics, estimators, etc.). Even very broad forms of non-parametric models are considered "models", so it encompasses a lot. It is difficult to conceive of how you could generate a machine learning algorithm without some assumptions about the generative process for the data, and consequently, one can probably broadly use the term "modelling" for any machine learning process. One might quibble with this, since some machine learning algorithms are broad non-parametric methods, but even here we usually called these "models", and consequently, I think it is reasonable to say that machine learning methods are built on "models". Even such simple methods as least-squares estimation are built on underlying statistical models.
There may certainly be situations in machine learning where an algorithm is built, and even deployed, without regard to setting underlying probabilistic assumptions. If the algorithm is sufficiently adaptive (in the sense that most non-parametric models are). In this case one could argue that the algorithm is "model-free" insofar as it was created without regard to any model. Even then, and even if the algorithm works well in a wide class of situations, one will still tend to find that there are cases where it works well and cases where it works badly. Consequently, subsequent analysts will usually be able to figure out the kinds of assumptions required to ensure that the algorithm works well when deployed in a situation. In this case, the "modelling" gradually catches up to the initial "model-free" creation of the algorithm as we begin to learn more about the situations where the algorithm works well or badly. So you could call some machine-learning algorithms "model-free" in one sense, but modelling catches us up in the end.
In view of these considerations, I think it is reasonable to say that all machine learning involves data-driven modelling. Of course, it is possible to do data-driven modelling without using a computer algorithm at all (e.g., calculation by pen and paper), and in these cases we would not usually call that "machine learning".
Best Answer
I think this is more about Bayesian/non-Bayesian statistics than machine learning vs.. statistics.
In Bayesian statistics parameter are modelled as random variables, too. If you have a joint distribution for $X,\alpha$, $p(X \mid \alpha)$ is a conditional distribution, no matter what the physical interpretation of $X$ and $\alpha$. If one considers only fixed $\alpha$s or otherwise does not put a probability distribution over $\alpha$, the computations with $p(X; \alpha)$ are exactly the same as with $p(X \mid \alpha)$ with $p(\alpha)$. Furthermore, one can at any point decide to extend the model with fixed values of $\alpha$ to one where there is a prior distribution over $\alpha$. To me at least, it seems strange that the notation for the distribution-given-$\alpha$ should change at this point, wherefore some Bayesians prefer to use the conditioning notation even if one has not (yet?) bothered to define all parameters as random variables.
Argument about whether one can write $p(X ; \alpha)$ as $p(X \mid \alpha)$ has also arisen in comments of Andrew Gelman's blog post Misunderstanding the $p$-value. For example, Larry Wasserman had the opinion that $\mid$ is not allowed when there is no conditioning-from-joint while Andrew Gelman had the opposite opinion.