Solved – Is the machine learning community abusing “true distribution”

machine learningterminology

In articles such as "Research #1: Generative Models" one reads once more statements such as

These images are examples of what our visual world looks like and we refer to these as "samples from the true data distribution"

the images meant are the ImageNet dataset.

We also read statements such as

Mathematically, we think about a dataset of examples $x_1,\dots,x_n$ as samples from a true data distribution $p(x)$. In the example image below, the blue region shows the part of the image space that, with a high probability (over some threshold) contains real images, and black dots indicate our data points (each is one image in our dataset).

and

Our goal then is to find parameters $\theta$ that produce a distribution that closely matches the true data distribution (for example, by having a small KL divergence loss).

Whenever I read such language:

  1. I wonder if statements such as the first one are not exaggerations in the sense that there is a huge number of 'real' images that could be in the dataset but are not in it. Or in other words that the ImageNet dataset is not representative of 'real image' even though it obviously samples it.

  2. Being a real image does not seem to be a probabilistic matter to me. An image either came from the real world or did not. It does not matter that mechanical turk is used to make people guess if the machine generated images look real like in the article. This does not justify the 'probabilistic' interpretation of what the machine learning algorithm is doing. In other words, don't statements such as with a high probability (over some threshold) actually point to abuse or sweeping under the rug?

  3. I wonder if statements such as the last one are not stretching it. We do not know the 'true data distribution', we only know a small number of samples of it. Using samples to measure loss is just a (huge) assumption about the relation of the samples to the distribution. That relation is swept under the rug.

  4. I wonder if it is not possible to interpret what the machine learning algorithm does that does not rely on such terminology? Per example simply talking about compression? And that the generalisation produced is merely an assumption on the continuity between the samples (regularity)?

I know my questions are vague, but I would like to know if other people also have such thoughts, if they are well founded, if they are treated explicitly in some book or paper that I can't seem to find.

Best Answer

  1. I wonder if statements such as the first one are not exaggerations in the sense that there is a huge number of 'real' images that could be in the dataset but are not in it. Or in other words that the ImageNet dataset is not representative of 'real image' even though it obviously samples it.

And this is exactly what your quotes say, that those are samples from the distribution.

  1. Being a real image does not seem to be a probabilistic matter to me. An image either came from the real world or did not. It does not matter that mechanical turk is used to make people guess if the machine generated images look real like in the article. This does not justify the 'probabilistic' interpretation of what the machine learning algorithm is doing. In other words, don't statements such as with a high probability (over some threshold) actually point to abuse or sweeping under the rug?

This depends on your definition of probability. It seems that in such cases as described above they implicitly adopt Bayesian understanding of probability, i.e. probability is a measure of how much believable something is. In such case you can talk about probabilities of non-repeatable events, for example what is the probability that the sun will rise tomorrow?. It will, or it will not, we have no data about such events happening in the past, so under a classical definition of probability this would not be a probabilistic problem, but it can be seen as probabilistic problem under Bayesian understanding of probability.

  1. I wonder if statements such as the last one are not stretching it. We do not know the 'true data distribution', we only know a small number of samples of it. Using samples to measure loss is just a (huge) assumption about the relation of the samples to the distribution. That relation is swept under the rug.

And this is basically what statistics is about: we infer about properties of the population given only the limited sample from it, that we have. We learn about properties of the true distribution given only the empirical distribution obtained from our sample.

  1. I wonder if it is not possible to interpret what the machine learning algorithm does that does not rely on such terminology? Per example simply talking about compression? And that the generalization produced is merely an assumption on the continuity between the samples (regularity)?

Yes, other models are possible, but it seems that the quotes you provided refer to probabilistic models. Again, statistics are grounded on probabilistic reasoning about data. Without probability it becomes much more problematic to quantify the uncertainty about the true distribution that rises from the fact that we are dealing with limited and imperfect data to infer about it.

Related Question