Clustering – Identify Spatially Contiguous Clusters in Raster Data Using KMeans

clusteringmachine learningrraster

I would like to cluster the cells of a raster object into k contiguous regions using kmeans. The number of regions, k, is known. Each cell has various geographical attributes, such as temperature, precipitation, elevation etc. And I need R to sort pixels into k groups (regions) using these cell values. With stats::kmeans() that is a fairly simple exercise. Unfortunately, this method does not create spatially contiguous clusters. Instead, each group consists of pixels spread all over the grid.

I expected this to be a common problem, but could not find any R function that solves it. There should be some package out there that can perform clustering under the restriction of spatial contiguity.

The method of choice here is kmeans because I know how many groups I need. Any other technique from the field of unsupervised learning that allows me to a priori set this quantity, is of course also welcome.

I asked this question on Stack Overflow before and received a recommendation to also post it here, as GIS might be a better fit than SO.

Best Answer

Based on the discussion above, I further explored the respective literature and found a very suitable algorithm addressing my question. The method is capable of defining neighborhood through rows and columns in the grid rather than distances between pixel centroids. I can directly choose between Queen's and Rook's case contiguity. Distances do not have to be included as explanatory variables and then weighted until regions are contiguous. The algorithm restricts regions to be spatially contiguous but does not necessarily make spatial distance or coordinates optimization variables. This provides increased flexibility in terms of region shapes.

The method is called SKATER (Spatial ‘K’luster Analysis by Tree Edge Removal) and is nicely outlined in the respective journal article by Assuncao et al. (2006). The spdep package implements the algorithm into R. The package documentation, though, is rather brief and unintuitive. Fortunately, Luc Anselin provides an illustrative tutorial on how to to apply spdep::skater() to spatial polygons using French regions as an example. My subsequent code is an adaptation of his example to spatial data in raster structure.

As underlying data we use three raster layers, r1, r2 and r3. They are all of equal extent and equal projection (equal-area Mollweide projection). They document the three explanatory variables, say surface temperature, precipitation and elevation, based on which want to assemble k regions.

# 1. Load packages
packs <- list("tidyverse", "raster", "spdep", "parallel")
lapply(packs, require, character.only = T)

# 2. Load raster layers using raster()

# 3. Define the number of regions (k + 1)
k <- 10

# 4. Merge explanatory variables in data frame
dat <- lapply(list(r1, r2, r3), values) %>%
   do.call(cbind, .) %>%
   as.data.frame(., stringsAsFactors = F) %>%
   magrittr::set_colnames(c("Temperature", "Precipitation", "Elevation"))

# 5. Set up parallel framework for faster computation (optional)
# For a non-parallel, single core execution skip steps 5 and 13
ncores <- detectCores() - 1
cl <- parallel::makeCluster(ncores, type = "PSOCK")
set.coresOption(ncores)
set.ClusterOption(cl)

# 6. Standardize variables
sdat <- scale(dat) %>%
   as.data.frame(., stringsAsFactors = F)

# 7. Create neighbor list object
raster_nb <- cell2nb(nrow(r1), ncol(r1), type = "queen")     # you can alternatively set contiguity to "rook"

# 8. Subset cells
# There are various reasons for which you might need to exclude some pixels to avoid errors in subsequent functions
# One example is missing values in your raster layers (e.g. due to water bodies)
complete_pixels <- which(complete.cases(sdat))
raster_nb <- subset.nb(raster_nb, 1:length(raster_nb) %in% complete_pixels)
sdat <- sdat[complete_pixels,]

# 9. Calculate dissimilarity between neighboring cells
lcosts <- nbcosts(raster_nb, sdat)

# 10. Calculate spatial weights based on dissimilarity between neighbors
raster_w <- nb2listw(raster_nb, lcosts, style = "B")

# 11. Obtain minimum spanning tree
raster_mst <- mstree(raster_w)

# 12. Run skater clustering algorithm
skater_clusters <- skater(raster_mst[,1:2], sdat, k)

# 13. Close parallel framework
stopCluster(cl)