Solved – Data partitioning for spatial data

geostatisticsmachine learningpartitioningrandom forestspatial

I am constructing different configurations of a Random Forest in order to investigate the influence of well-design variables and location, on the first-year production volumes of shale oil wells, within a given area in the US. In the different model configurations, I control for location in different ways, to show how the influence of well-design variables may be biased when the spatial resolution of the models is inadequate. Here, location acts as a proxy for geological properties/reservoir quality.

I have a dataset of ~4500 wells, with 6 variables. The response is the first-year production volume, and the predictors are three different well-design variables in addition to longitude and latitude.

I have been researching and putting some thought into the subject of data partitioning when working with spatial data. For instance, in this chapter of "Geocomputation with R" by Lovelace et al. (https://geocompr.robinlovelace.net/spatial-cv.html), they highlight the importance of spatial cross-validation: "Randomly splitting spatial data can lead to training points that are neighbors in space with test points. Due to spatial autocorrelation, test and training datasets would not be independent in this scenario, with the consequence that CV fails to detect possible overfitting. Spatial CV alleviates this problem and is the central theme of this chapter."

Further, they illustrate how a spatial partitioning may differ from a random partitioning:
enter image description here

…and show an example of how results may be positively biased if spatial data is split at random (this is the difference in AUC of a classification problem):
AUC under different CV-schemes

The point is that due to spatial autocorrelation (near things are more related than distant things), you will end up with some observations in the training set that are very similar to observations in the test set if the proximity of observations is not accounted for when splitting the data. This may cause "information leakage" between the sets.

My question is, does this information leakage necessarily pose a problem? I figure that this and the similarity of observations is something that may just as well be representative of the problem at hand, and therefore make the performance assessment more representative of a real-life application of the model. I understand that a spatially disjoint test set yields a more representative performance assessment of your model if it should be used for predicting on a completely new and distant area. But if you want to assess a model's predictive performance with respect to a mix of near and distant locations, wouldn't a random split be more reasonable?

Hoping for some input here, thanks!

Edit: After reaching out to the authors of the abovementioned book on Twitter, I was advised to check out the following lecture by Hanna Meyer: https://www.youtube.com/watch?v=mkHlmYEzsVQ. She makes a distinction between "data reproduction" and "data prediction" (at approximately 16:40 in the video). This is something that crossed my mind while initially writing this post; that I am not really applying these models for prediction, but rather using predictive models as a tool for investigating factors that influence well productivity. After watching the video, I have become more confident that this application is more like "data reproduction", where a random partitioning seems OK, rather than "data prediction".

Best Answer

After watching the video, I have become more confident that this application is more like "data reproduction", where a random partitioning is OK, rather than "data prediction".

To me, you justify your choice of using random CV for spatial ML models too much with "if I use it for data reproduction, it is ok".

  • The overarching goal of fitting a model is to make predictions on unseen data.
  • The overarching goal of a performance assessment of a model is to have a generalizable overview of its performance in all kind of prediction scenarios. Because there is only a limited amount of data available, one tries to make the best out of it by estimating a performance via CV.

Prediction scenarios in spatial modeling always come with the fact that the prediction set is spatially distant to the training data (with a varying degree of distance). And when this happens, you just lie to yourself with a model performance estimated via non-spatial (NSP) CV.

If you justify your choice of using a random CV in a spatial model scenario with "I just want to reproduce the data" than you do not need to conduct a CV at all / estimate the performance. In this case, you can fit the model and try to interpret it (if this is possible).

A spatial CV with a random partitioning is sometimes very similar to training and testing your model on the same data (i.e. train on "train" and test on "train") because both sets are highly similar due to the chosen partitioning. Remember that the partitioning was freely chosen by you here, most likely because it is the most popular scheme in ML. While it is perfectly fine for non-spatial data (because there is no such things as spatial relations), it is in fact just completely unsuited for spatial data. There is no problem in reporting a less good performance of a model. And due to the availability of dedicated partitioning strategies for spatial data (at least in R, see 1, 2, 3) there is also no excuse that no method was available to easily apply it (even though the non-availabilty of implementations in programming languages should never be a justification to use non-valid ones).

I cannot think of a valid reason why one would even just start trying to justify the correctness of non-spatial partitioning methods in certain scenarios with the potential goal of reporting it in a scientific study/report. Trying to do so feels like "cheating on purpose" to me for the sake of better performance values. We are on a long way making the spatial modeling community aware of the need to use spatial partitioning methods for CV. I want to ensure that people coming across this question here understand that using random CV is not a choice under certain circumstances and that its results are completely invalid for any kind of reporting.

The same applies for optimizing hyperparameters: If you do this with CV and use random partitioning, you are in danger of selecting non-optimal parameters for the performance estimation in the outer loop. While the differences of using spatial/non-spatial CV for tuning might not be as high as the ones for the performance estimation, they just introduce another potential bias into your study design (Schratz et al. 2019).

Related Question