Strategies for Working with Large Shapefiles in R

rshapefile

I am interested in learning about how to work with "Road Network" files – for example, I would like to find out the driving distance between two sets of geographical coordinates (i.e. longitude and latitude).

I found this shapefile for the Canadian Road Networks : https://www12.statcan.gc.ca/census-recensement/2011/geo/RNF-FRR/files-fichiers/lrnf000r22a_e.zip – now, I am trying to import this file into R.

Below, this is the code I am using to first download the shapefile to a temporary folder and then try to read the shapefile:

library(sf)
library(rgdal)
# Set the URL for the shapefile
url <- "https://www12.statcan.gc.ca/census-recensement/2011/geo/RNF-FRR/files-fichiers/lrnf000r22a_e.zip"

# Create a temporary folder to download and extract the shapefile
temp_dir <- tempdir()
temp_file <- file.path(temp_dir, "lrnf000r22a_e.zip")

# Download the shapefile to the temporary folder
download.file(url, temp_file)

# Extract the shapefile from the downloaded zip file
unzip(temp_file, exdir = temp_dir)

# Read the shapefile using the rgdal package
library(rgdal)
shapefile <- readOGR(dsn = temp_dir, layer = "lrnf000r22a_e")

But when trying to run the last line of code (readOGR), I get the following error:

OGR data source with driver: ESRI Shapefile 
Source: "C:\Users\me\AppData\Local\Temp\RtmpwDKofs", layer: "lrnf000r22a_e"
with 2246324 features
It has 21 fields
Integer64 fields read as strings:  OBJECTID 
Error: memory exhausted (limit reached?)
In addition: Warning messages:
1: OGR support is provided by the sf and terra packages among others 
2: OGR support is provided by the sf and terra packages among others 
3: OGR support is provided by the sf and terra packages among others 
4: OGR support is provided by the sf and terra packages among others 
5: OGR support is provided by the sf and terra packages among others 
6: OGR support is provided by the sf and terra packages among others

This seems to be a very large shapefile and my computer does not have enough memory to work with this file.

Has anyone ever encountered this problem before? Are there any strategies I can use to resolving this error?

Note: Perhaps its possible to read this in "chunks"? (e.g. Read N number of rows from shapefile using GeoPandas)

Best Answer

Using the sf package you can apply an SQL SELECT query to a shapefile read using st_read with the query argument. For example if you have a huge road network to load you could maybe restrict it to main roads, or roads within a smaller bounding box that you are interested in.

For example to select only the highways:

hwy = st_read("./lrnf000r22a_e.shp",
       query="select * from lrnf000r22a_e where TYPE='HWY'")

That gets you 115619 features down from the 2.2 million in the full data set.

You can do a lot more filtering with st_read including using query= to select "chunks" by specifying ID numbers (I think this data set uses OBJECTID), and you can also pass a wkt_filter parameter to get only data within some polygonal bounds - see the examples in the help for st_read for full details.

Related Solutions

[GIS] Large Shapefiles in R

Use "x <- sf::st_read" it's much faster than rgdal and will happily read much larger data sets.

When you plot it, be sure to use "plot (st_geometry(x))" first, since it will otherwise do a lot of repeated drawing for the default multi panel layout. It defaults to faceting on every column, up to some max.

Indexing with "[" works the same as with sp, so use that to plot just a few rows before trying the entire thing if it's really massive.

[GIS] Clustering spatially connected polygons so all clusters have approximately the same size

I finally found a way, I scripted my own clustering function in R which works relatively great. The idea is to start from the neighbor list and iteratively make groups. The algorithm looks a little bit like that:

Make one group
1. select first polygon
2. find it's neighbors
3. group them
4. find the neigbors neighbors
5. group them
6. repeat until target size reach
Remove all polygons grouped before from the neighbors list
repeat 1 and 2 until no more polygons are availables
Fuse the small leftover groups to majors groups

I coded this algorithm in R in the form of 4 functions. They are available on my github. It may not be perfect but it works well enough for my need and I manage to get it to work on a fairly large shape (~13000 polygons). It's weakness are potentially:

Can produce group of size quite different from target size (e.g. aiming at 1000 finishing qith 1400)
the shape of the groups are not optimized (like bubble shaped)

Here is the code for the 4 main functions (however, github version is more recent):

## function to make individual groups
make_a_group <- function(neighb_list, gr_size) {

  # make first group
  gr <- list(gr = c(as.numeric(names(neighb_list)[1]), neighb_list[[1]]),
             eval = as.numeric(names(neighb_list)[1]))
  nn <- length(gr$gr)

  # Until you reach target cluster size, do:
  while(nn < gr_size){

    # find neighbours to evaluate
    to_eval <- gr$gr[!gr$gr %in% gr$eval]
    if(length(to_eval)==0) break() # if none, stop
    i = 1
    nb_to_eval <- length(to_eval)

    # and evaluate them sequentially until you did them al or reach target size
    while(nn < gr_size & i<=nb_to_eval){
      gr$gr <- unique(c(gr$gr, neighb_list[[as.character(to_eval[i])]]))
      i = i+1
      nn <- length(gr$gr)
    }

    # update the info on which polygon was evaluated
    gr$eval <- c(gr$eval, to_eval[1:(i-1)])
  }
  lapply(gr, sort)
}

## function that group the leftover polygons
##  (sometime, a polygon looses all it's neighbors because they were assign other groups)
group_empty_pol <- function(pol_name, neighb_list, the_groups){

  # find the original neighbors of this polygon
  all_ne <-neighb_list[[pol_name]]

  # Find the group which has the most polygons in common with its neighbors
  better_gr <- which.max(sapply(the_groups, function(xx) sum(xx %in% all_ne)))

  # add him to the group
  the_groups[[better_gr]] <- sort(c(as.numeric(pol_name), the_groups[[better_gr]]))
  the_groups
}



## Function to split the whole shape in groups from neighbours list

cluster_neighbours <- function(neighb_list, gr_size){

  # initialize object
  ll_group <- list()
  neighb_temp <- neighb_list

  # until all are neighbors list is empty, do:
  while(length(neighb_temp)>0){

    # Make a group
    res1 <- make_a_group(neighb_list = neighb_temp, gr_size = gr_size)

    # save the group
    ll_group <- append(ll_group, list(res1$gr))

    # Remove from neighbor list all polygons from the just made group
    neighb_temp <- neighb_temp[!names(neighb_temp)%in%as.character(res1$gr)] %>% 
      lapply(function(xx) xx[!xx%in%res1$gr])                                       # remove the used neighbors from neighbor list of left neighbors

    ## managing polygons with no more neighbors
    lost_all_neighb <- neighb_temp[sapply(neighb_temp, length)==0] %>% names

    # if some polygons lost all their neighbors, do:
    if(length(lost_all_neighb)>0){
      for(i in 1:length(lost_all_neighb)){
        #assign the lonely polygon to a group
        ll_group <- group_empty_pol(pol_name = lost_all_neighb[i], neighb_list = neighb_list, the_groups = ll_group)

        # Remove it from the neighbor list
        neighb_temp <- neighb_temp[!names(neighb_temp)%in%lost_all_neighb[i]] %>% 
          lapply(function(xx) xx[!xx%in%as.numeric(lost_all_neighb[i])])       
      }
    }
    ##
  }
  ll_group
}



## final grouping of tiny groups
##  (sometime, some tiny groups are left, we want to fuse them to bigger ones)
fuse_tiny_group <- function(initial_gr, init_sh, hard_min_size){

  # initialize data
  initial_gr <- initial_gr %>% 
    setNames(1:length(.))

  # assign good group to polygon, and dissolve the shape by group
  sh_with_gr <- bind_cols(init_sh,
                          do.call("rbind", mapply(function(gr, index) cbind(index, gr),1:length(initial_gr), initial_gr)) %>% 
                            as.data.frame() %>% 
                            arrange(index)
                          ) %>% 
    select(gr) %>% 
    mutate(gr = as.character(gr)) %>% 
    group_by(gr) %>% 
    summarise(n = n()) %>% 
    arrange(as.numeric(gr)) 

  # Calculate new neighbors
  adj_gr <- sh_with_gr %>% 
    st_touches() %>% 
    setNames(sh_with_gr$gr)

  # find the small groups that are not island (which cannot be group to anything)
  small_gr <- sh_with_gr$gr[sh_with_gr$n<hard_min_size]
  not_island <- names(adj_gr)[sapply(adj_gr, length)!=0]
  small_gr <- intersect(small_gr, not_island)

  # for every small group, fuse it with the smaller number of polygons
  for(igr in small_gr) {
    groups_close_by <- adj_gr[[igr]]

    group_selected <- initial_gr %>% 
      subset(.,names(.)%in%groups_close_by) %>% 
      sapply(length) %>% 
      sort() %>%
      names %>% 
      head(1)

    initial_gr[[group_selected]] <- sort(c(initial_gr[[group_selected]], initial_gr[[igr]]))
    initial_gr[[igr]] <- NULL
  }
  initial_gr
}

And here is a example to use it:

library(sf)
library(dplyr)

DA <- st_read("C:/Users/DXD9163/Desktop/DA_regrouping/DA_CAN_2016_reduced.shp", stringsAsFactors = F) %>% 
  filter(!is.na(DAUID))

DA_qc <- DA[grepl("Quebec",DA$PRNAME),]

DA_qc$area <- st_area(DA_qc)
DA_qc <- DA_qc %>% 
  group_by(DAUID) %>% 
  summarise(area=sum(area))

neighb <- st_touches(DA_qc, sparse = T) %>% 
  setNames(1:nrow(.)) 

gr1 <- cluster_neighbours(neighb_list = neighb, gr_size = 1000)

sapply(gr1, length)

final_group <- fuse_tiny_group(initial_gr = gr1, init_sh = DA_qc, hard_min_size = 400)

bind_cols(DA_qc,
          do.call("rbind", mapply(function(gr, index) cbind(index, gr),1:length(final_group), final_group)) %>% 
            as.data.frame() %>% 
            arrange(index)
          ) %>% 
  select(gr) %>% 
  mutate(gr = as.character(gr)) %>% 
  st_write("C:/Users/DXD9163/Desktop/DA_regrouping/sh_separe/result.shp")

Which gives:

and zoomed:

Best Answer

Related Solutions

[GIS] Large Shapefiles in R

[GIS] Clustering spatially connected polygons so all clusters have approximately the same size

Related Question