Machine Learning – Realistically, Does the IID Assumption Hold for Most Supervised Learning Tasks

assumptionsdatasetiidlinear modelmachine learning

The i.i.d. assumption states:

We are given a data set, $\{(x_i,y_i)\}_{i = 1, \ldots, n}$, each data $(x_i,y_i)$ is generated in an independent and identically distributed fashion.

To me, physically this means that we can imagine that the generation of $(x_i,y_i)$ has no affect on $(x_j,y_j)$, $j \neq i$ and vice versa.

But does this hold true in practice?

For example, the most basic machine learning task is prediction on MNIST dataset. Is there a way to know whether MNIST was generated in an i.i.d. fashion? Similarly for thousands of other data sets. How do we "any practitioner" know how the data set is generated?

Sometimes I also see people mentioning shuffling your data to make the distribution more independent or random. Does shuffling tangibly create benefit as compared to a non-shuffled data set?

For example, suppose we create a "sequential" MNIST dataset contained digits arranged in an increasing sequence 1,2,3,4,5,6,..obviously, the data set was not generated in an independent fashion. If you generate 1, the next one must be 2. But does training a classifier on this data set has any difference as compared to a shuffled dataset?

Just some basic questions.

Best Answer

The operational meaning of the IID condition is given by the celebrated "representation theorem" of Bruno de Finetti (which, in my humble opinion, is one of the greatest innovations of probability theory ever discovered). According to this brilliant theorem, if we have a sequence $\mathbf{X}=(X_1,X_2,X_3,...)$ with empirical distribution $F_\mathbf{x}$, if the values in the sequence are exchangeable then we have:

$$X_1,X_2,X_3, ... | F_\mathbf{x} \sim \text{IID } F_\mathbf{x}.$$

This means that the condition of exchangeability of an infinite sequence of values is the operational condition required for the values to be independent and identically distributed (conditional on some underlying distribution function). The theorem can be applied in both Bayesian and classical statistics (see O'Neill 2009 for further discussion), and in the latter case, the empirical distribution is treated as an "unknown constant" and so we usually drop the conditioning notation. Among other things, this theorem clarifies the requirement for "repeated trials" in the frequentist definition of probability.

As with many other probabilistic results, the "representation theorem" actually refers to a class of theorems that apply in various different cases. You can find a good summary of the various representation theorems in Kingman (1978) and Ressel (1985). The original version, due to de Finetti, established this correspondence only for binary sequences of values. This was later extended to the more general version that is the most commonly used (and corresponds to the version shown above), by Hewitt and Savage (1955). This latter representation theorem is sometimes called the de Finetti-Hewitt-Savage theorem, since it is their extension that gives the full power of the theorem. There is another useful extension by Diaconis and Freedman (1980) that establishes a representation theorem for cases of finite exchangeability --- roughly speaking, in this case the values are "almost IID" in the sense that there is a bounded difference in probabilities from the actual probabilities and an IID approximation.

As the other answers on this thread point out, the IID condition has various advantages in terms of mathematical convenience and simplicity. While I do not see that as a justification of realism, it is certainly an ancillary benefit of this model structure, and it speaks to the importance of the representation theorems. These theorems give an operational grounding for the IID model, and show that it is sufficient to assume exchangeability of an infinite sequence to obtain this model. Thus, in practice, if you want to know if a sequence of values is IID, all you need to do is ask yourself, "If I took any finite set of values from this sequence, would their probability measure change if I were to change the order of those values?" If the answer is no, then you have an exchangeable sequence, and hence, the IID condition is met.

Related Question