IID Sampling Test – How to Test for IID Sampling

hypothesis testingiidindependencekolmogorov-smirnov testresampling

How would you test or check that sampling is IID (Independent and Identically Distributed)? Note that I do not mean Gaussian and Identically Distributed, just IID.

And idea that comes to my mind is to repeatedly split the sample in two sub-samples of equal size, perform the Kolmogorov-Smirnov test and check that the distribution of the p-values is uniform.

Any comment on that approach, and any suggestion is welcome.

Clarification after starting bounty:
I am looking for a general test that can be applied to non time series data.

Best Answer

What you conclude about if data is IID comes from outside information, not the data itself. You as the scientist need to determine if it is a reasonable to assume the data IID based on how the data was collected and other outside information.

Consider some examples.

Scenario 1: We generate a set of data independently from a single distribution that happens to be a mixture of 2 normals.

Scenario 2: We first generate a gender variable from a binomial distribution, then within males and females we independently generate data from a normal distribution (but the normals are different for males and females), then we delete or lose the gender information.

In scenario 1 the data is IID and in scenario 2 the data is clearly not Identically distributed (different distributions for males and females), but the 2 distributions for the 2 scenarios are indistinguishable from the data, you have to know things about how the data was generated to determine the difference.

Scenario 3: I take a simple random sample of people living in my city and administer a survey and analyse the results to make inferences about all people in the city.

Scenario 4: I take a simple random sample of people living in my city and administer a survey and analyze the results to make inferences about all people in the country.

In scenario 3 the subjects would be considered independent (simple random sample of the population of interest), but in scenario 4 they would not be considered independent because they were selected from a small subset of the population of interest and the geographic closeness would likely impose dependence. But the 2 datasets are identical, it is the way that we intend to use the data that determines if they are independent or dependent in this case.

So there is no way to test using only the data to show that data is IID, plots and other diagnostics can show some types of non-IID, but lack of these does not guarantee that the data is IID. You can also compare to specific assumptions (IID normal is easier to disprove than just IID). Any test is still just a rule out, but failure to reject the tests never proves that it is IID.

Decisions about whether you are willing to assume that IID conditions hold need to be made based on the science of how the data was collected, how it relates to other information, and how it will be used.

Edits:

Here are another set of examples for non-identical.

Scenario 5: the data is residuals from a regression where there is heteroscedasticity (the variances are not equal).

Scenario 6: the data is from a mixture of normals with mean 0 but different variances.

In scenario 5 we can clearly see that the residuals are not identically distributed if we plot the residuals against fitted values or other variables (predictors, or potential predictors), but the residuals themselves (without the outside info) would be indistinguishable from scenario 6.