i.i.d. Assumption – Importance in Statistical Learning

cross-validationiidmachine learningnon-independent

In statistical learning, implicitly or explicitly, one always assumes that the training set $\mathcal{D} = \{ \bf {X}, \bf{y} \}$ is composed of $N$ input/response tuples $({\bf{X}}_i,y_i)$ that are independently drawn from the same joint distribution $\mathbb{P}({\bf{X}},y)$ with

$$ p({\bf{X}},y) = p( y \vert {\bf{X}}) p({\bf{X}}) $$

and $p( y \vert {\bf{X}})$ the relationship we are trying to capture through a particular learning algorithm. Mathematically, this i.i.d. assumption writes:

\begin{gather}
({\bf{X}}_i,y_i) \sim \mathbb{P}({\bf{X}},y), \forall i=1,…,N \\
({\bf{X}}_i,y_i) \text{ independent of } ({\bf{X}}_j,y_j), \forall i \ne j \in \{1,…,N\}
\end{gather}

I think we can all agree that this assumption is rarely satisfied in practice, see this related SE question and the wise comments of @Glen_b and @Luca.

My question is therefore:

Where exactly does the i.i.d. assumption becomes critical in practice?

[Context]

I'm asking this because I can think of many situations where such a stringent assumption is not needed to train a certain model (e.g. linear regression methods), or at least one can work around the i.i.d. assumption and obtain robust results. Actually the results will usually stay the same, it is rather the inferences that one can draw that will change (e.g. heteroskedasticity and autocorrelation consistent HAC estimators in linear regression: the idea is to re-use the good old OLS regression weights but to adapt the finite-sample behaviour of the OLS estimator to account for the violation of the Gauss-Markov assumptions).

My guess is therefore that the i.i.d. assumption is required not to be able to train a particular learning algorithm, but rather to guarantee that techniques such as cross-validation can indeed be used to infer a reliable measure of the model's capability of generalising well, which is the only thing we are interested in at the end of the day in statistical learning because it shows that we can indeed learn from the data. Intuitively, I can indeed understand that using cross-validation on dependent data could be optimistically biased (as illustrated/explained in this interesting example).

For me i.i.d. has thus nothing to do with training a particular model but everything to do with that model's generalisability. This seems to agree with a paper I found by Huan Xu et al, see "Robustness and Generalizability for Markovian Samples" here.

Would you agree with that?

[Example]

If this can help the discussion, consider the problem of using the LASSO algorithm to perform a smart selection amongst $P$ features given $N$ training samples $({\bf{X}}_i,y_i)$ with $\forall i=1,…,N$
$$ {\bf{X}}_i=[X_{i1},…,X_{iP}] $$
We can further assume that:

The inputs ${\bf{X}}_i$ are dependent hence leading to a violation of the i.i.d. assumption (e.g. for each feature $j=1,..,P$ we observe a $N$ point time series, hence introducing temporal auto-correlation)
The conditional responses $y_i \vert {\bf{X}}_i$ are independent.
We have $P \gg N$.

In what way(s) does the violation of the i.i.d. assumption can pose problem in that case assuming we plan to determine the LASSO penalisation coefficient $\lambda$ using a cross-validation approach (on the full data set) + use a nested cross-validation to get a feel for the generalisation error of this learning strategy (we can leave the discussion concerning the inherent pros/cons of the LASSO aside, except if it is useful).

Best Answer

The i.i.d. assumption about the pairs $(\mathbf{X}_i, y_i)$, $i = 1, \ldots, N$, is often made in statistics and in machine learning. Sometimes for a good reason, sometimes out of convenience and sometimes just because we usually make this assumption. To satisfactorily answer if the assumption is really necessary, and what the consequences are of not making this assumption, I would easily end up writing a book (if you ever easily end up doing something like that). Here I will try to give a brief overview of what I find to be the most important aspects.

A fundamental assumption

Let's assume that we want to learn a probability model of $y$ given $\mathbf{X}$, which we call $p(y \mid \mathbf{X})$. We do not make any assumptions about this model a priory, but we will make the minimal assumption that such a model exists such that

the conditional distribution of $y_i$ given $\mathbf{X}_i$ is $p(y_i \mid \mathbf{X}_i)$.

What is worth noting about this assumption is that the conditional distribution of $y_i$ depends on $i$ only through $\mathbf{X}_i$. This is what makes the model useful, e.g. for prediction. The assumption holds as a consequence of the identically distributed part under the i.i.d. assumption, but it is weaker because we don't make any assumptions about the $\mathbf{X}_i$'s.

In the following the focus will mostly be on the role of independence.

Modelling

There are two major approaches to learning a model of $y$ given $\mathbf{X}$. One approach is known as discriminative modelling and the other as generative modelling.

Discriminative modelling: We model $p(y \mid \mathbf{X})$ directly, e.g. a logistic regression model, a neural network, a tree or a random forest. The working modelling assumption will typically be that the $y_i$'s are conditionally independent given the $\mathbf{X}_i$'s, though estimation techniques relying on subsampling or bootstrapping make most sense under the i.i.d. or the weaker exchangeability assumption (see below). But generally, for discriminative modelling we don't need to make distributional assumptions about the $\mathbf{X}_i$'s.
Generative modelling: We model the joint distribution, $p(\mathbf{X}, y)$, of $(\mathbf{X}, y)$ typically by modelling the conditional distribution $p(\mathbf{X} \mid y)$ and the marginal distribution $p(y)$. Then we use Bayes's formula for computing $p(y \mid \mathbf{X})$. Linear discriminant analysis and naive Bayes methods are examples. The working modelling assumption will typically be the i.i.d. assumption.

For both modelling approaches the working modelling assumption is used to derive or propose learning methods (or estimators). That could be by maximising the (penalised) log-likelihood, minimising the empirical risk or by using Bayesian methods. Even if the working modelling assumption is wrong, the resulting method can still provide a sensible fit of $p(y \mid \mathbf{X})$.

Some techniques used together with discriminative modelling, such as bagging (bootstrap aggregation), work by fitting many models to data sampled randomly from the dataset. Without the i.i.d. assumption (or exchangeability) the resampled datasets will not have a joint distribution similar to that of the original dataset. Any dependence structure has become "messed up" by the resampling. I have not thought deeply about this, but I don't see why that should necessarily break the method as a method for learning $p(y \mid \mathbf{X})$. At least not for methods based on the working independence assumptions. I am happy to be proved wrong here.

Consistency and error bounds

A central question for all learning methods is whether they result in models close to $p(y \mid \mathbf{X})$. There is a vast theoretical literature in statistics and machine learning dealing with consistency and error bounds. A main goal of this literature is to prove that the learned model is close to $p(y \mid \mathbf{X})$ when $N$ is large. Consistency is a qualitative assurance, while error bounds provide (semi-) explicit quantitative control of the closeness and give rates of convergence.

The theoretical results all rely on assumptions about the joint distribution of the observations in the dataset. Often the working modelling assumptions mentioned above are made (that is, conditional independence for discriminative modelling and i.i.d. for generative modelling). For discriminative modelling, consistency and error bounds will require that the $\mathbf{X}_i$'s fulfil certain conditions. In classical regression one such condition is that $\frac{1}{N} \mathbb{X}^T \mathbb{X} \to \Sigma$ for $N \to \infty$, where $\mathbb{X}$ denotes the design matrix with rows $\mathbf{X}_i^T$. Weaker conditions may be enough for consistency. In sparse learning another such condition is the restricted eigenvalue condition, see e.g. On the conditions used to prove oracle results for the Lasso. The i.i.d. assumption together with some technical distributional assumptions imply that some such sufficient conditions are fulfilled with large probability, and thus the i.i.d. assumption may prove to be a sufficient but not a necessary assumption to get consistency and error bounds for discriminative modelling.

The working modelling assumption of independence may be wrong for either of the modelling approaches. As a rough rule-of-thumb one can still expect consistency if the data comes from an ergodic process, and one can still expect some error bounds if the process is sufficiently fast mixing. A precise mathematical definition of these concepts would take us too far away from the main question. It is enough to note that there exist dependence structures besides the i.i.d. assumption for which the learning methods can be proved to work as $N$ tends to infinity.

If we have more detailed knowledge about the dependence structure, we may choose to replace the working independence assumption used for modelling with a model that captures the dependence structure as well. This is often done for time series. A better working model may result in a more efficient method.

Model assessment

Rather than proving that the learning method gives a model close to $p(y \mid \mathbf{X})$ it is of great practical value to obtain a (relative) assessment of "how good a learned model is". Such assessment scores are comparable for two or more learned models, but they will not provide an absolute assessment of how close a learned model is to $p(y \mid \mathbf{X})$. Estimates of assessment scores are typically computed empirically based on splitting the dataset into a training and a test dataset or by using cross-validation.

As with bagging, a random splitting of the dataset will "mess up" any dependence structure. However, for methods based on the working independence assumptions, ergodicity assumptions weaker than i.i.d. should be sufficient for the assessment estimates to be reasonable, though standard errors on these estimates will be very difficult to come up with.

[Edit: Dependence among the variables will result in a distribution of the learned model that differs from the distribution under the i.i.d. assumption. The estimate produced by cross-validation is not obviously related to the generalization error. If the dependence is strong, it will most likely be a poor estimate.]

Summary (tl;dr)

All the above is under the assumption that there is a fixed conditional probability model, $p(y \mid \mathbf{X})$. Thus there cannot be trends or sudden changes in the conditional distribution not captured by $\mathbf{X}$.

When learning a model of $y$ given $\mathbf{X}$, independence plays a role as

a useful working modelling assumption that allows us to derive learning methods
a sufficient but not necessary assumption for proving consistency and providing error bounds
a sufficient but not necessary assumption for using random data splitting techniques such as bagging for learning and cross-validation for assessment.

To understand precisely what alternatives to i.i.d. that are also sufficient is non-trivial and to some extent a research subject.

update: What is "wrong" with the the scikit-learn example?

First of all, nothing is wrong with nested cross validation here. Nested validation is of utmost importance for data-driven optimization, and cross validation is a very powerful approaches (particularly if iterated/repeated).

Then, whether anything is wrong at all depends on your point of view: as long as you do an honest nested validation (keeping the outer test data strictly independent), the outer validation is a proper measure of the "optimal" model's performance. Nothing wrong with that.

But several things can and do go wrong with grid search of these proportion-type performance measures for hyperparameter tuning of SVM. Basically they mean that you may (probably?) cannont rely on the optimization. Nevertheless, as long as your outer split was done properly, even if the model is not the best possible, you have an honest estimate of the performance of the model you got.

I'll try to give intuitive explanations why the optimization may be in trouble:

Mathematically/statisticaly speaking, the problem with the proportions is that measured proportions $\hat p$ are subject to a huge variance due to finite test sample size $n$ (depending also on the true performance of the model, $p$):
$Var (\hat p) = \frac{p (1 - p)}{n}$

You need ridiculously huge numbers of cases (at least compared to the numbers of cases I can usually have) in order to achieve the needed precision (bias/variance sense) for estimating recall, precision (machine learning performance sense). This of course applies also to ratios you calculate from such proportions. Have a look at the confidence intervals for binomial proportions. They are shockingly large! Often larger than the true improvement in performance over the hyperparameter grid. And statistically speaking, grid search is a massive multiple comparison problem: the more points of the grid you evaluate, the higher the risk of finding some combination of hyperparameters that accidentally looks very good for the train/test split you are evaluating. This is what I mean with skimming variance. The well known optimistic bias of the inner (optimization) validation is just a symptom of this variance skimming.
Intuitively, consider a hypothetical change of a hyperparameter, that slowly causes the model to deteriorate: one test case moves towards the decision boundary. The 'hard' proportion performance measures do not detect this until the case crosses the border and is on the wrong side. Then, however, they immediately assign a full error for an infinitely small change in the hyperparameter.
In order to do numerical optimization, you need the performance measure to be well behaved. That means: neither the jumpy (not continously differentiable) part of the proportion-type performance measure nor the fact that other than that jump, actually occuring changes are not detected are suitable for the optimization.
Proper scoring rules are defined in a way that is particularly suitable for optimization. They have their global maximum when the predicted probabilities match the true probabilities for each case to belong to the class in question.
For SVMs you have the additional problem that not only the performance measures but also the model reacts in this jumpy fashion: small changes of the hyperparameter will not change anything. The model changes only when the hyperparameters are changes enough to cause some case to either stop being support vector or to become support vector. Again, such models are hard to optimize.

Literature:

Brown, L.; Cai, T. & DasGupta, A.: Interval Estimation for a Binomial Proportion, Statistical Science, 16, 101-133 (2001).
Cawley, G. C. & Talbot, N. L. C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 11, 2079-2107 (2010).
Gneiting, T. & Raftery, A. E.: Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102, 359-378 (2007). DOI: 10.1198/016214506000001437
Brereton, R.: Chemometrics for pattern recognition, Wiley, (2009).
points out the jumpy behaviour of the SVM as function of the hyperparameters.

Update II: Skimming variance

what you can afford in terms of model comparison obviously depends on the number of independent cases. Let's make some quick and dirty simulation about the risk of skimming variance here:

scikit.learn says that they have 1797 are in the digits data.

assume that 100 models are compared, e.g. a $10 \times 10$ grid for 2 parameters.
assume that both parameter (ranges) do not affect the models at all,
i.e., all models have the same true performance of, say, 97 % (typical performance for the digits data set).

Run $10^4$ simulations of "testing these models" with sample size = 1797 rows in the digits data set

p.true = 0.97 # hypothetical true performance for all models
n.models = 100 # 10 x 10 grid

n.rows = 1797 # rows in scikit digits data

sim.test <- replicate (expr= rbinom (n= nmodels, size= n.rows, prob= p.true), 
                       n = 1e4)
sim.test <- colMaxs (sim.test) # take best model

hist (sim.test / n.rows, 
      breaks = (round (p.true * n.rows) : n.rows) / n.rows + 1 / 2 / n.rows, 
      col = "black", main = 'Distribution max. observed performance',
      xlab = "max. observed performance", ylab = "n runs")
abline (v = p.outer, col = "red")

Here's the distribution for the best observed performance:

skimming variance simulation

The red line marks the true performance of all our hypothetical models. On average, we observe only 2/3 of the true error rate for the seemingly best of the 100 compared models (for the simulation we know that they all perform equally with 97% correct predictions).

This simulation is obviously very much simplified:

In addition to the test sample size variance there is at least the variance due to model instability, so we're underestimating the variance here
Tuning parameters affecting the model complexity will typically cover parameter sets where the models are unstable and thus have high variance.
For the UCI digits from the example, the original data base has ca. 11000 digits written by 44 persons. What if the data is clustered according to the person who wrote? (I.e. is it easier to recognize an 8 written by someone if you know how that person writes, say, a 3?) The effective sample size then may be as low as 44.
Tuning model hyperparameters may lead to correlation between the models (in fact, that would be considered well behaved from a numerical optimization perspective). It is difficult to predict the influence of that (and I suspect this is impossible without taking into account the actual type of classifier).

In general, however, both low number of independent test cases and high number of compared models increase the bias. Also, the Cawley and Talbot paper gives empirical observed behaviour.

Solved – Cross validation in semi-supervised learning

Cross-validation procedures are not invalidated in this setting, provided you use a suitable score function.

A suitable score function could be recall^2 divided by the probability for a classifier to predict one. This is a surrogate for the F1-score in a PU learning setting. You can find an example of its use here.