[GIS] Clustering spatially connected polygons so all clusters have approximately the same size

grassqgisrsagaspatial-cluster

I want to group together polygons from a shapefile a way that all the clusters formed contained approximately the same number of polygons. The problem is that the source polygons have vastly different sizes. Therefore, I cannot simply make a grid and lay it over my polygons.

The final goal of this work is to do some kind of "tiling" of the polygons to call computer heavy analysis on a sub-sample of my data.

The clustering has to be done strictly from a spatial contiguous polygons point of view. I don't care about any attributes in the shape.

There are some similar questions out there (for example here and here), however my question differs as I don't use attribute clustering, have vastly different size polygons and can't use ArcGIS.

I'm open for a solution is R, QGIS, SAGA or GRASS.

The figure below shows an example of what I mean by vastly different size:

enter image description here

with the red circle being:

enter image description here

My data are openly available at :
http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/lda_000b16a_e.zip

I was thinking of scripting the solution so I can change the size of the cluster, but my ideal cluster size would be around 1000 polygons per group.

Best Answer

I finally found a way, I scripted my own clustering function in R which works relatively great. The idea is to start from the neighbor list and iteratively make groups. The algorithm looks a little bit like that:

  1. Make one group
    1. select first polygon
    2. find it's neighbors
    3. group them
    4. find the neigbors neighbors
    5. group them
    6. repeat until target size reach
  2. Remove all polygons grouped before from the neighbors list
  3. repeat 1 and 2 until no more polygons are availables
  4. Fuse the small leftover groups to majors groups

I coded this algorithm in R in the form of 4 functions. They are available on my github. It may not be perfect but it works well enough for my need and I manage to get it to work on a fairly large shape (~13000 polygons). It's weakness are potentially:

  • Can produce group of size quite different from target size (e.g. aiming at 1000 finishing qith 1400)
  • the shape of the groups are not optimized (like bubble shaped)

Here is the code for the 4 main functions (however, github version is more recent):

## function to make individual groups
make_a_group <- function(neighb_list, gr_size) {

  # make first group
  gr <- list(gr = c(as.numeric(names(neighb_list)[1]), neighb_list[[1]]),
             eval = as.numeric(names(neighb_list)[1]))
  nn <- length(gr$gr)

  # Until you reach target cluster size, do:
  while(nn < gr_size){

    # find neighbours to evaluate
    to_eval <- gr$gr[!gr$gr %in% gr$eval]
    if(length(to_eval)==0) break() # if none, stop
    i = 1
    nb_to_eval <- length(to_eval)

    # and evaluate them sequentially until you did them al or reach target size
    while(nn < gr_size & i<=nb_to_eval){
      gr$gr <- unique(c(gr$gr, neighb_list[[as.character(to_eval[i])]]))
      i = i+1
      nn <- length(gr$gr)
    }

    # update the info on which polygon was evaluated
    gr$eval <- c(gr$eval, to_eval[1:(i-1)])
  }
  lapply(gr, sort)
}

## function that group the leftover polygons
##  (sometime, a polygon looses all it's neighbors because they were assign other groups)
group_empty_pol <- function(pol_name, neighb_list, the_groups){

  # find the original neighbors of this polygon
  all_ne <-neighb_list[[pol_name]]

  # Find the group which has the most polygons in common with its neighbors
  better_gr <- which.max(sapply(the_groups, function(xx) sum(xx %in% all_ne)))

  # add him to the group
  the_groups[[better_gr]] <- sort(c(as.numeric(pol_name), the_groups[[better_gr]]))
  the_groups
}



## Function to split the whole shape in groups from neighbours list

cluster_neighbours <- function(neighb_list, gr_size){

  # initialize object
  ll_group <- list()
  neighb_temp <- neighb_list

  # until all are neighbors list is empty, do:
  while(length(neighb_temp)>0){

    # Make a group
    res1 <- make_a_group(neighb_list = neighb_temp, gr_size = gr_size)

    # save the group
    ll_group <- append(ll_group, list(res1$gr))

    # Remove from neighbor list all polygons from the just made group
    neighb_temp <- neighb_temp[!names(neighb_temp)%in%as.character(res1$gr)] %>% 
      lapply(function(xx) xx[!xx%in%res1$gr])                                       # remove the used neighbors from neighbor list of left neighbors

    ## managing polygons with no more neighbors
    lost_all_neighb <- neighb_temp[sapply(neighb_temp, length)==0] %>% names

    # if some polygons lost all their neighbors, do:
    if(length(lost_all_neighb)>0){
      for(i in 1:length(lost_all_neighb)){
        #assign the lonely polygon to a group
        ll_group <- group_empty_pol(pol_name = lost_all_neighb[i], neighb_list = neighb_list, the_groups = ll_group)

        # Remove it from the neighbor list
        neighb_temp <- neighb_temp[!names(neighb_temp)%in%lost_all_neighb[i]] %>% 
          lapply(function(xx) xx[!xx%in%as.numeric(lost_all_neighb[i])])       
      }
    }
    ##
  }
  ll_group
}



## final grouping of tiny groups
##  (sometime, some tiny groups are left, we want to fuse them to bigger ones)
fuse_tiny_group <- function(initial_gr, init_sh, hard_min_size){

  # initialize data
  initial_gr <- initial_gr %>% 
    setNames(1:length(.))

  # assign good group to polygon, and dissolve the shape by group
  sh_with_gr <- bind_cols(init_sh,
                          do.call("rbind", mapply(function(gr, index) cbind(index, gr),1:length(initial_gr), initial_gr)) %>% 
                            as.data.frame() %>% 
                            arrange(index)
                          ) %>% 
    select(gr) %>% 
    mutate(gr = as.character(gr)) %>% 
    group_by(gr) %>% 
    summarise(n = n()) %>% 
    arrange(as.numeric(gr)) 

  # Calculate new neighbors
  adj_gr <- sh_with_gr %>% 
    st_touches() %>% 
    setNames(sh_with_gr$gr)

  # find the small groups that are not island (which cannot be group to anything)
  small_gr <- sh_with_gr$gr[sh_with_gr$n<hard_min_size]
  not_island <- names(adj_gr)[sapply(adj_gr, length)!=0]
  small_gr <- intersect(small_gr, not_island)

  # for every small group, fuse it with the smaller number of polygons
  for(igr in small_gr) {
    groups_close_by <- adj_gr[[igr]]

    group_selected <- initial_gr %>% 
      subset(.,names(.)%in%groups_close_by) %>% 
      sapply(length) %>% 
      sort() %>%
      names %>% 
      head(1)

    initial_gr[[group_selected]] <- sort(c(initial_gr[[group_selected]], initial_gr[[igr]]))
    initial_gr[[igr]] <- NULL
  }
  initial_gr
}

And here is a example to use it:

library(sf)
library(dplyr)

DA <- st_read("C:/Users/DXD9163/Desktop/DA_regrouping/DA_CAN_2016_reduced.shp", stringsAsFactors = F) %>% 
  filter(!is.na(DAUID))

DA_qc <- DA[grepl("Quebec",DA$PRNAME),]

DA_qc$area <- st_area(DA_qc)
DA_qc <- DA_qc %>% 
  group_by(DAUID) %>% 
  summarise(area=sum(area))

neighb <- st_touches(DA_qc, sparse = T) %>% 
  setNames(1:nrow(.)) 

gr1 <- cluster_neighbours(neighb_list = neighb, gr_size = 1000)

sapply(gr1, length)

final_group <- fuse_tiny_group(initial_gr = gr1, init_sh = DA_qc, hard_min_size = 400)

bind_cols(DA_qc,
          do.call("rbind", mapply(function(gr, index) cbind(index, gr),1:length(final_group), final_group)) %>% 
            as.data.frame() %>% 
            arrange(index)
          ) %>% 
  select(gr) %>% 
  mutate(gr = as.character(gr)) %>% 
  st_write("C:/Users/DXD9163/Desktop/DA_regrouping/sh_separe/result.shp")

Which gives:

enter image description here

and zoomed:

enter image description here

Related Question