References on data partitioning (cross-validation, train/val/test set construction) when data are non-IID

cross-validationiidpredictive-modelsreferencesspatial correlation

Consider a prediction setting in which we are interested in training a regression or classification function $f$ with inputs $X \in \mathbb{R}^k$ and target $Y$, and assessing its expected generalization performance on new data. For example, in a regression setting, we might want to estimate what our model's mean absolute error (MAE) or root mean square error (RMSE) will be on unseen data, and we typically do that by evaluating our model's predictions on a held-out test set.

Many standard textbooks (for example, Chapter 7, Model Assessment and Selection, of The Elements of Statistical Learning) discuss data partitioning and cross-validation in a context where we are dealing with i.i.d. samples from a joint distribution $F\left(X, Y\right)$. In this case, constructing the test set is straightforward: we take a simple random sample and call it our test set.

Are there any standard references that discuss how model assessment strategies should be modified when data are not i.i.d.? I've seen discussions that are specific to particular settings (for example, Data partitioning for spatial data is about spatial data), but I was wondering whether there is a general reference covering multiple settings.

Examples include:

  • Spatial data: the model may generalize more easily to points that are geographically close to the training set, and if we are interested in estimating the model's ability to generalize to points that are far from the training set, we'd need to account for that in our data partitioning / test set construction
  • Data with a natural discrete group or hierarchical structure: for example, if we are dealing with patients and hospitals, the question "how does my model generalize to new patients in a hospital that was included in the training set" differs from "how does my model generalize to new hospitals", and we should construct our test set(s) accordingly; we may even want to answer both questions, which we could do with two different test sets (one on held-out patients and another on held-out hospitals)
  • Similarly, if dealing with panel data (such as observing individuals over time), the question "How does my model do when predicting on an individual who was already observed $K$ times?" might have a different answer than "How does my model do on its first prediction for an individual who was never observed before?"
  • Time series data: in some time series contexts (see https://stats.stackexchange.com/a/195438/9330 for an exception), we want to construct our test sets such that they cover a period that is entirely "in the future" relative to the training set
  • More complex real-world examples might involve multiple "complications" relative to the simple i.i.d. case: we might be dealing with spatiotemporal data, for example, and it might also have a hierarchical or group structure

Does any textbook or paper discuss cross validation or test set construction in real-world problems, in a way that is general enough to cover all of the examples above (and possibly more)?

Related:

Best Answer

It really all boils down to two rules of thumb:

  1. When splitting your data, leave out what you want to predict. If you want to generalize to new hospitals, rather than new patients at the same hospital, leave out one hospital at a time when doing CV — do not leave out one patient at a time, as this only tests your ability to generalize to patients at the same hospital.
  2. When doing cross-validation, split your test data into folds that can be considered approximately independent. For example, with time series data, you want to leave out a single run/“chunk” of observations at a time. If you have a time series running from 1900 to 2000 and want to use 10 folds, the first fold should be the first 10 years, the second the next 10, and so on. The idea here is that even if a time series isn’t independent, we can think of 10 years as enough for most of the correlation to disappear, especially if we’re comparing models that are already OK at dealing with the time series structure of our data. If we assign each year to a random fold, a model can easily “Cheat” by assuming that 2020 will look just the same as 2021 and 2019, but it’s hard to predict 2020 from 2010. A correlogram can help you identify how long a lag is enough that you can consider each block “Basically independent” of the others.

Some relevant papers:

https://onlinelibrary.wiley.com/doi/abs/10.1111/ecog.02881

https://www.sciencedirect.com/science/article/pii/S0020025511006773

https://www.tandfonline.com/doi/full/10.1080/00949655.2020.1783262

You can also check out the Sperrorest R package for this.