R – Fast Algorithm to Calculate Difference Between Two Sets of Polygons Comparable to QGIS

differenceseraseqgisrst-difference

I have two very large polygon shapefiles for which I need to calculate the difference between the two shapefiles. One is a shapefile of forest polygons, the other is a shapefile of road polygons. I want to remove the area of the forest polygons that intersects with road polygons, while maintaining each polygon as a separate feature and its data.

After much trial and error, I found that erase using the sp format or using st_erase using the sf format performs the operation that I need.

st_erase = function(x, y) st_difference(x, st_union(st_combine(y)))

However, the processing time is really long. With a small sample of my dataset, I have processing times several minutes long. (And R crashes with a tenth of the dataset). Meanwhile, the function Vector–>Geoprocessing tools–> Difference in QGIS takes 83 seconds with a tenth of the dataset.

erase(forest_sp, roads_sp)
#2.83488 minutes 
forest_sp-roads_sp
#2.8698 minutes
st_erase(forest, roads) 
#58.77sec

Is there a function in R performs this operation quickly? Or is R inferior to QGIS in processing times for this kind of spatial function?

Background

There are multiple questions on gis.stackexchange that deal with this type of problem. Some are with the sp format and suggest gDifference or erase (or its analog polygon1-polygon2):

Erase, gDifference

Clip polygon and retain data?

gDifference

Remove parts of line that fall within a polygon

Identify area within one shapefile not in another shapefile in R

Reverse clipping (erasing ) in R?

Other questions suggest solutions with the sf format:
st_difference

Reverse clipping (erasing ) in R?

Using simple features (sf) in R, how do I erase polygons overlapping with another layer

ogr2ogr equivalent of QGIS Union

All state that they perform that same thing (remove intersecting portions) and the official tag definition on gis.stackexchange states that a difference operation removes sections. However, I have found that often they don't.
For example, gDifference with byID=FALSE that merges all into one MULTIPOLYGON does remove overlapping sections but gDifference with byID=TRUE that keeps polygons separate includes overlapping sections. With sf, st_difference also keeping overlapping sections and ends up with 10x as many polygons as my original file.

enter image description here

However, erase and the helper function st_erase do work as expected.

enter image description here

Difference

A "difference" operation is a geoprocessing operation that completely
removes one overlapping polygon and the intersection of that polygon
from the other. It uses one polygon to "take a bite" out of another.

Erase

The action of removing data attributes or features through the
processes of positive or negative selection and deletion.

Best Answer

The rmapshaper package offers a faster function for "erasing" or finding the difference between two polygon layers in R. Below is a comparison of the st_erase approach with sf and the ms_erase function in rmapshaper.

# Let's try an example using publicly available Census vector layers through `tigris`
if(!require(dplyr)){install.packages("dplyr")}
library(dplyr)
if(!require(sf)){install.packages("sf")}
library(sf)
if(!require(tigris)){install.packages("tigris")}
library(tigris)
if(!require(rmapshaper)){install.packages("rmapshaper")}
library(rmapshaper)

# First download two separate layers that overlap. We'll use a layer of Census 
# tracts and a layer of areal water features for Worcester County in Massachusetts.
# worcester_tracts has 172 obs and  10 variables; about 285KB in size.
worcester_tracts <- tracts(state = "MA", county = "Worcester", cb = TRUE) %>% 
  st_transform(., crs = 2805) %>% # use projected layers when doing geometric stuff
  st_make_valid()
# worcester_water has 3,270 obs and 9 variables; about 4.9MB in size. 
worcester_water <- area_water(state = "MA", county = "Worcester") %>% 
  st_transform(., crs = 2805) %>% 
  st_make_valid()

# Start with st_erase function. Note that I altered function slightly by removing 
# the st_combine function because it throws an error and isn't necessary. 
# Takes about 27 secs.
st_erase = function(x, y) st_difference(x, st_union(y))
system.time(erased_tracts1 <- st_erase(worcester_tracts, worcester_water))

# Now try it with ms_erase. Takes about 4.3 secs; over 6x faster. 
system.time(erased_tracts2 <- ms_erase(worcester_tracts, worcester_water))

The results are identical. You can speed things up even more by retaining as few variables, or attributes, as possible in the layers when doing these operations. Also, if your computer memory is being overwhelmed, consider breaking up the process into smaller geographic chunks (e.g., municipalities, counties, a grid), rather than working on the whole dataset at one time.