The i.i.d. assumption about the pairs $(\mathbf{X}_i, y_i)$, $i = 1, \ldots, N$, is often made in statistics and in machine learning. Sometimes for a good reason, sometimes out of convenience and sometimes just because we usually make this assumption. To satisfactorily answer if the assumption is really necessary, and what the consequences are of not making this assumption, I would easily end up writing a book (if you ever easily end up doing something like that). Here I will try to give a brief overview of what I find to be the most important aspects.
A fundamental assumption
Let's assume that we want to learn a probability model of $y$ given $\mathbf{X}$, which we call $p(y \mid \mathbf{X})$. We do not make any assumptions about this model a priory, but we will make the minimal assumption that such a model exists such that
- the conditional distribution of $y_i$ given $\mathbf{X}_i$ is $p(y_i \mid \mathbf{X}_i)$.
What is worth noting about this assumption is that the conditional distribution of $y_i$ depends on $i$ only through $\mathbf{X}_i$. This is what makes the model useful, e.g. for prediction. The assumption holds as a consequence of the identically distributed part under the i.i.d. assumption, but it is weaker because we don't make any assumptions about the $\mathbf{X}_i$'s.
In the following the focus will mostly be on the role of independence.
Modelling
There are two major approaches to learning a model of $y$ given $\mathbf{X}$. One approach is known as discriminative modelling and the other as generative modelling.
- Discriminative modelling: We model $p(y \mid \mathbf{X})$ directly, e.g. a logistic regression model, a neural network, a tree or a random forest. The working modelling assumption will typically be that the $y_i$'s are conditionally independent given the $\mathbf{X}_i$'s, though estimation techniques relying on subsampling or bootstrapping make most sense under the i.i.d. or the weaker exchangeability assumption (see below). But generally, for discriminative modelling we don't need to make distributional assumptions about the $\mathbf{X}_i$'s.
- Generative modelling: We model the joint distribution, $p(\mathbf{X}, y)$, of $(\mathbf{X}, y)$ typically by modelling the conditional distribution $p(\mathbf{X} \mid y)$ and the marginal distribution $p(y)$. Then we use Bayes's formula for computing $p(y \mid \mathbf{X})$. Linear discriminant analysis and naive Bayes methods are examples. The working modelling assumption will typically be the i.i.d. assumption.
For both modelling approaches the working modelling assumption is used to derive or propose learning methods (or estimators). That could be by maximising the (penalised) log-likelihood, minimising the empirical risk or by using Bayesian methods. Even if the working modelling assumption is wrong, the resulting method can still provide a sensible fit of $p(y \mid \mathbf{X})$.
Some techniques used together with discriminative modelling, such as bagging (bootstrap aggregation), work by fitting many models to data sampled randomly from the dataset. Without the i.i.d. assumption (or exchangeability) the resampled datasets will not have a joint distribution similar to that of the original dataset. Any dependence structure has become "messed up" by the resampling. I have not thought deeply about this, but I don't see why that should necessarily break the method as a method for learning $p(y \mid \mathbf{X})$. At least not for methods based on the working independence assumptions. I am happy to be proved wrong here.
Consistency and error bounds
A central question for all learning methods is whether they result in models close to $p(y \mid \mathbf{X})$. There is a vast theoretical literature in statistics and machine learning dealing with consistency and error bounds. A main goal of this literature is to prove that the learned model is close to $p(y \mid \mathbf{X})$ when $N$ is large. Consistency is a qualitative assurance, while error bounds provide (semi-) explicit quantitative control of the closeness and give rates of convergence.
The theoretical results all rely on assumptions about the joint distribution of the observations in the dataset. Often the working modelling assumptions mentioned above are made (that is, conditional independence for discriminative modelling and i.i.d. for generative modelling). For discriminative modelling, consistency and error bounds will require that the $\mathbf{X}_i$'s fulfil certain conditions. In classical regression one such condition is that $\frac{1}{N} \mathbb{X}^T \mathbb{X} \to \Sigma$ for $N \to \infty$, where $\mathbb{X}$ denotes the design matrix with rows $\mathbf{X}_i^T$. Weaker conditions may be enough for consistency. In sparse learning another such condition is the restricted eigenvalue condition, see e.g. On the conditions used to prove oracle results for the Lasso. The i.i.d. assumption together with some technical distributional assumptions imply that some such sufficient conditions are fulfilled with large probability, and thus the i.i.d. assumption may prove to be a sufficient but not a necessary assumption to get consistency and error bounds for discriminative modelling.
The working modelling assumption of independence may be wrong for either of the modelling approaches. As a rough rule-of-thumb one can still expect consistency if the data comes from an ergodic process, and one can still expect some error bounds if the process is sufficiently fast mixing. A precise mathematical definition of these concepts would take us too far away from the main question. It is enough to note that there exist dependence structures besides the i.i.d. assumption for which the learning methods can be proved to work as $N$ tends to infinity.
If we have more detailed knowledge about the dependence structure, we may choose to replace the working independence assumption used for modelling with a model that captures the dependence structure as well. This is often done for time series. A better working model may result in a more efficient method.
Model assessment
Rather than proving that the learning method gives a model close to $p(y \mid \mathbf{X})$ it is of great practical value to obtain a (relative) assessment of "how good a learned model is". Such assessment scores are comparable for two or more learned models, but they will not provide an absolute assessment of how close a learned model is to $p(y \mid \mathbf{X})$. Estimates of assessment scores are typically computed empirically based on splitting the dataset into a training and a test dataset or by using cross-validation.
As with bagging, a random splitting of the dataset will "mess up" any dependence structure. However, for methods based on the working independence assumptions, ergodicity assumptions weaker than i.i.d. should be sufficient for the assessment estimates to be reasonable, though standard errors on these estimates will be very difficult to come up with.
[Edit: Dependence among the variables will result in a distribution of the learned model that differs from the distribution under the i.i.d. assumption. The estimate produced by cross-validation is not obviously related to the generalization error. If the dependence is strong, it will most likely be a poor estimate.]
Summary (tl;dr)
All the above is under the assumption that there is a fixed conditional probability model, $p(y \mid \mathbf{X})$. Thus there cannot be trends or sudden changes in the conditional distribution not captured by $\mathbf{X}$.
When learning a model of $y$ given $\mathbf{X}$, independence plays a role as
- a useful working modelling assumption that allows us to derive learning methods
- a sufficient but not necessary assumption for proving consistency and providing error bounds
- a sufficient but not necessary assumption for using random data splitting techniques such as bagging for learning and cross-validation for assessment.
To understand precisely what alternatives to i.i.d. that are also sufficient is non-trivial and to some extent a research subject.
- Probability distrubution over datasets: What are the datasets? How is the probability distribution generated?
Once we can estimate the underlying distributions of the input data, we essentially know how they are picked and can do good predictions. (generative model). Normally, we can assume an underlying distribution according to what we believe (inductive bias). For example, if we believe that there is a high probability that values are close to zero, we can take a Gaussian distribution with mean $0$ and tune the parameters like variance when we train. Datasets are, for example, set of all coin tosses and the distribution assumed will be binomial. When we do say maximizing log-likelihood for the actual data points, we will get those parameters which make the dataset fit into the distribution assumed.
- The examples are independent of each other. Can you give me an example of where the examples are dependent?
For example, we toss a coin and if we have a head we toss another otherwise we do not. Here there is a dependence between subsequent tosses
Drawn from the same probability distribution as each other. Suppose the probability distribution is Gaussian. Does the term "same probability distribution" mean that all the examples are drawn from a Gaussian distribution with the same mean and variance?
"This assumption enables us". What does this mean?
Yes. That is why (4) is said. Once you have a probability distribution from one example, you do not need other examples to describe the data generating process.
- Finally, for the last paragraph of page 122, it is given that the samples follow Bernoulli distribution. What does this mean intuitively?
It means that each example can be thought of as a coin toss. If the experiment was multiple coin tosses, you would have each coin toss independent with a probability of head to be $\frac{1}{2}$. Similarly, if you choose any other experiment, the result of each example can be thought of as a coin toss or an n-dimensional dice.
Generating examples means getting a distribution closest to what we see in the dataset for training. That is got by assuming a distribution and maximizing the likelihood of the given dataset and outputting the optimum parameters.
Best Answer
The operational meaning of the IID condition is given by the celebrated "representation theorem" of Bruno de Finetti (which, in my humble opinion, is one of the greatest innovations of probability theory ever discovered). According to this brilliant theorem, if we have a sequence $\mathbf{X}=(X_1,X_2,X_3,...)$ with empirical distribution $F_\mathbf{x}$, if the values in the sequence are exchangeable then we have:
$$X_1,X_2,X_3, ... | F_\mathbf{x} \sim \text{IID } F_\mathbf{x}.$$
This means that the condition of exchangeability of an infinite sequence of values is the operational condition required for the values to be independent and identically distributed (conditional on some underlying distribution function). The theorem can be applied in both Bayesian and classical statistics (see O'Neill 2009 for further discussion), and in the latter case, the empirical distribution is treated as an "unknown constant" and so we usually drop the conditioning notation. Among other things, this theorem clarifies the requirement for "repeated trials" in the frequentist definition of probability.
As with many other probabilistic results, the "representation theorem" actually refers to a class of theorems that apply in various different cases. You can find a good summary of the various representation theorems in Kingman (1978) and Ressel (1985). The original version, due to de Finetti, established this correspondence only for binary sequences of values. This was later extended to the more general version that is the most commonly used (and corresponds to the version shown above), by Hewitt and Savage (1955). This latter representation theorem is sometimes called the de Finetti-Hewitt-Savage theorem, since it is their extension that gives the full power of the theorem. There is another useful extension by Diaconis and Freedman (1980) that establishes a representation theorem for cases of finite exchangeability --- roughly speaking, in this case the values are "almost IID" in the sense that there is a bounded difference in probabilities from the actual probabilities and an IID approximation.
As the other answers on this thread point out, the IID condition has various advantages in terms of mathematical convenience and simplicity. While I do not see that as a justification of realism, it is certainly an ancillary benefit of this model structure, and it speaks to the importance of the representation theorems. These theorems give an operational grounding for the IID model, and show that it is sufficient to assume exchangeability of an infinite sequence to obtain this model. Thus, in practice, if you want to know if a sequence of values is IID, all you need to do is ask yourself, "If I took any finite set of values from this sequence, would their probability measure change if I were to change the order of those values?" If the answer is no, then you have an exchangeable sequence, and hence, the IID condition is met.