[GIS] creating clusters of point data

clustering

I have a map of stores for a retailer. I am trying to form clusters (of these stores) such that each member of a cluster is at most 5km away from any other member and any non-member is at least 10 km from any member. I also would like to create a field in the point dataset indicating to which cluster a specific store beongs to. What would be the easiest way to do this?
Thanks

Best Answer

As @whuber says, I think you're going to need to relax your criteria, because it won't work as stands. One possible algorithm would be:

for each point calculate the distance to the nearest other point - in Mapinfo this can be done using the Distance Calculator tool, or in MapBasic
arrange these distances in descending order
for the smallest nearest distance, select all points that are within 5km (or whatever parameter you choose) of that point -
draw a convex hull around these points (a shape that encloses all points)
remove those points from the list and continue down the descending list of nearest distances, only selecting those points which have not already been made into a convex hull

There are automated routines to do this sort of analysis in packages like ArcGIS or R, but processing time will be significant if you have a large number of points. You will probably have to experiment to get the best results, I don't think it's possible to lay down hard rules about clusters as you have above.

Related Solutions

[GIS] Removing Outliers from point layer

This is actually a data science question vs. one directly related to GIS. And it's a concept that is way more complicated than it initially seems. If you're interested in ways of categorizing/grouping that data, you'll want to take a look at data clustering (2) and machine learning. The stats section of Stack Exchange would be a good resource to start with.

Clustering – Identify Spatially Contiguous Clusters in Raster Data Using KMeans

Based on the discussion above, I further explored the respective literature and found a very suitable algorithm addressing my question. The method is capable of defining neighborhood through rows and columns in the grid rather than distances between pixel centroids. I can directly choose between Queen's and Rook's case contiguity. Distances do not have to be included as explanatory variables and then weighted until regions are contiguous. The algorithm restricts regions to be spatially contiguous but does not necessarily make spatial distance or coordinates optimization variables. This provides increased flexibility in terms of region shapes.

The method is called SKATER (Spatial ‘K’luster Analysis by Tree Edge Removal) and is nicely outlined in the respective journal article by Assuncao et al. (2006). The spdep package implements the algorithm into R. The package documentation, though, is rather brief and unintuitive. Fortunately, Luc Anselin provides an illustrative tutorial on how to to apply spdep::skater() to spatial polygons using French regions as an example. My subsequent code is an adaptation of his example to spatial data in raster structure.

As underlying data we use three raster layers, r1, r2 and r3. They are all of equal extent and equal projection (equal-area Mollweide projection). They document the three explanatory variables, say surface temperature, precipitation and elevation, based on which want to assemble k regions.

# 1. Load packages
packs <- list("tidyverse", "raster", "spdep", "parallel")
lapply(packs, require, character.only = T)

# 2. Load raster layers using raster()

# 3. Define the number of regions (k + 1)
k <- 10

# 4. Merge explanatory variables in data frame
dat <- lapply(list(r1, r2, r3), values) %>%
   do.call(cbind, .) %>%
   as.data.frame(., stringsAsFactors = F) %>%
   magrittr::set_colnames(c("Temperature", "Precipitation", "Elevation"))

# 5. Set up parallel framework for faster computation (optional)
# For a non-parallel, single core execution skip steps 5 and 13
ncores <- detectCores() - 1
cl <- parallel::makeCluster(ncores, type = "PSOCK")
set.coresOption(ncores)
set.ClusterOption(cl)

# 6. Standardize variables
sdat <- scale(dat) %>%
   as.data.frame(., stringsAsFactors = F)

# 7. Create neighbor list object
raster_nb <- cell2nb(nrow(r1), ncol(r1), type = "queen")     # you can alternatively set contiguity to "rook"

# 8. Subset cells
# There are various reasons for which you might need to exclude some pixels to avoid errors in subsequent functions
# One example is missing values in your raster layers (e.g. due to water bodies)
complete_pixels <- which(complete.cases(sdat))
raster_nb <- subset.nb(raster_nb, 1:length(raster_nb) %in% complete_pixels)
sdat <- sdat[complete_pixels,]

# 9. Calculate dissimilarity between neighboring cells
lcosts <- nbcosts(raster_nb, sdat)

# 10. Calculate spatial weights based on dissimilarity between neighbors
raster_w <- nb2listw(raster_nb, lcosts, style = "B")

# 11. Obtain minimum spanning tree
raster_mst <- mstree(raster_w)

# 12. Run skater clustering algorithm
skater_clusters <- skater(raster_mst[,1:2], sdat, k)

# 13. Close parallel framework
stopCluster(cl)

Best Answer

Related Solutions

[GIS] Removing Outliers from point layer

Clustering – Identify Spatially Contiguous Clusters in Raster Data Using KMeans

Related Question