If I understand you correctly, you want to draw a distance-constrained random sample from your data for each observation in the data. This is akin to a K nearest neighbor analysis.
Here is an example workflow that will create a kNN random sample, using a minimum distance constraint, and add the corresponding rowname back to your data.
Add libraries and example data
library(sp)
data(meuse)
coordinates(meuse) <- ~x+y
Calculate a distance matrix using spDists
dmat <- spDists(meuse)
Define minimum sample distance and set to NA in distance matrix. Here is where you would create any type of constraint say, a distance range.
min.dist <- 500
dmat[dmat <= min.dist] <- NA
Here we iterate through each row in the distance matrix and select a random sample != NA. The "samples" object is a data.frame where ID is the rownames of the source object and kNN is the rowname of the nearest neighbor. Note; there is some NA handling added just in case no neighbor is found, which could happen with distance constraints.
samples <- data.frame(ID=rownames(meuse@data), kNN=NA)
for(i in 1:nrow(dmat) ) {
x <- as.vector( dmat[,i] )
names(x) <- samples$ID
x <- x[!is.na(x)]
if(!length(x) == 0) {
samples[i,][2] <- names(x)[sample(1:length(x), 1)]
} else {
samples[i,][2] <- NA
}
}
We can then add the kNN column, containing the rownames of the nearest neighbor, to the original data.
meuse@data <- data.frame(meuse@data, kNN=samples$kNN)
head(meuse@data)
We could also subset the unique nearest neighbor observations.
meuse.sub <- meuse[which(rownames(meuse@data) %in% unique(samples$kNN)),]
There are much more elegant ways to perform this analysis but this workflow gets the general idea across. I would recommend taking a hard look at the spdep library and dnearneigh or knearneigh functions for a more advanced solution.
What I do:
Create attribute for raster (Build Raster Attribute Table in ArcGIS)
Select a class/row in raster attribute table
Use raster to point tool to create a BUNCH of points for each pixel of that value
Use Subset Features Tool to create specified number of random points
Note: this is the technique I used for accuracy assessments, I compare the points against the imagery to see how valid the classification was.
Best Answer
Well, if it was as simple as an out-of-the-box sample function with a distance argument, I would have provided that as a solution (although now, I may write one). Depending on your sample, you can actually add bias to the resulting subsampled data by adding an explicit distance criteria. There are also cases, with highly clustered data juxtaposed with randomly distributed observations, where you may weaken the autocorrelation but, functionally, not remove it. I know that distance-based subsampling is common practice in RSF models but, it is quite arbitrary in both application and legacy.
One thing I would recommend trying, in reducing autocorrelation or pseudoreplication (in this case interchangeable terms), is to subsample the data based on the observed spatial process itself. The function "pp.subsample" in the spatialEco package will create a subsample based on the expected spatial intensity function of the observed data thus, reducing clustering. A quick look at the data will indicate if one would want a 1st or 2nd order bandwidth. If there is significant localized clustering then I would recommend sigma = "diggle" or " stoyan". In contrast, if there is weak clustering over large distance lags, then a 1st order bandwidth, like "scott", is in order.
The size (n) of the subsample is slightly more complicated and you may need to perform a power test to find the trade-off between a degree of weak autocorrelation, that will likely not effect iid or residual error in an OLS or GLM, and statistical power given sample size.