Hypothesis Testing – Does the Assumption of ‘Independent and Identically Distributed’ Apply to Actual Sample Data or Sampling Process?

assumptionshypothesis testingiidmodelingsampling

While studying statistics, I came across the concept of “independent and identically distributed random variables” or IID.

I’m confused as to what that applies to in practice, mainly, does it apply to the sampled data or the sampling process?

Let’s say I’m continually running a survey (via simple random sampling) asking users whether they like/dislike a service I provide. The data would not pass IID assumption because: 1. The distribution of likes/dislikes will fluctuate over time based on service quality fluctuations (not identically distributed) and 2. The proportion of likes/dislikes will likely have trends and be autocorrelated (not independent).

So if IID applies to the sample data, it does not pass muster in this scenario; however, if applied to the sampling process, then it does pass because using simple random sampling means each user had the same probably of being selected to take the survey (identically distributed) and one user being selected has no impact on future users being selected (independent).

Are my conclusions above correct? When checking for IID assumption in statistical modeling, am I checking it against the data distribution itself or the process that generated the data (sampling methodology)?

Best Answer

Your question is closely related to another question here asking about when it is realistic to assume that data are IID. Much of my present answer is adapted from my answer to the linked question.

As noted in my answer to the linked question, the operational meaning of the IID assumption is based on the condition of exchangeability via the "representation theorem" of Bruno de Finetti and others. Suppose you have an observable sequence $\mathbf{X}=(X_1,X_2,X_3,...)$ with empirical distribution $F_\mathbf{x}$. The representation theorem says that if the values in the sequence are exchangeable then you get the conditional IID result:

$$X_1,X_2,X_3, ... | F_\mathbf{x} \sim \text{IID } F_\mathbf{x}.$$

This means that the condition of exchangeability of an infinite sequence of values is the operational condition required for the values to be independent and identically distributed (conditional on some underlying distribution function). The theorem can be applied in both Bayesian and classical statistics (see O'Neill 2009 for further discussion), and in the latter case the underlying empirical distribution is treated as an "unknown constant" which effectively gives the corresponding marginal IID result. Note also that in parametric models, the distribution $F_\mathbf{x}$ is usually indexed by a small number of real parameters, which means that the observable data values are IID conditional on those parameters.

As to whether this is an assumption that applies to the data or the sampling process, it is probably more accurate to say that it is an assumption about the sampling process that manifests in a particular type of behaviour for the data. Exchangeability of the observable sequence just means that the order of the data points doesn't matter, no matter how large the sample. So if you think that your sampling process is such that the order of the values does not give any information about them (probabilistically speaking) then you can assume that the observable sequence of data is exchangeable, and so the data is IID (conditional on the parameters of your model). Contrarily, if you think that your sampling process is such that the order of the values does give any information about them (probabilistically speaking) then the data is not IID. It is also worth noting that exchangeability can be tested empirically using runs tests, so we are not entirely reliant on untested assumptions.

In the example you give in your question, you have an observable time-series and you are of the opinion that the order of the data matters, since the proess may change over time. That means that you believe that exchangeability does not hold, so the data is not IID. In that particular case, you would probably want to use some kind of time-series model that allows for auto-correlation in the data, or changes in the process over time. So yes, you are broadly correct in your understanding of when it is and is not reasonable to assume that data is IID. For a deeper understanding, I recommend you read up on the representation theorem; you can also read O'Neill (2009) for some surrounding discussion of how this theorem applies in Bayesian and classical contexts.

Related Question