R Raster Processing – Increasing Speed of Crop, Mask, & Extract by Polygons

clipmaskingperformancerraster

I'm extracting the area and percent cover of different land use types from a raster based on several thousand polygon boundaries. I've found that the extract function works much faster if I iterate through each individual polygon and crop then mask the raster down to the size of the particular polygon. Nonetheless, it's pretty slow, and I'm wondering if anyone has any suggestions for improving the efficiency and speed of my code.

The only thing I've found related to this is this response by Roger Bivand who suggested using GDAL.open() and GDAL.close() as well as getRasterTable() and getRasterData(). I looked into those, but have had trouble with gdal in the past and don't know it well enough to know how to implement it.

Reproducible Example:

library(maptools)  ## For wrld_simpl
library(raster)

## Example SpatialPolygonsDataFrame
data(wrld_simpl) #polygon of world countries
bound <- wrld_simpl[1:25,] #name it this to subset to 25 countries and because my loop is set up with that variable  

## Example RasterLayer
c <- raster(nrow=2e3, ncol=2e3, crs=proj4string(wrld_simpl), xmn=-180, xmx=180, ymn=-90, ymx=90)
c[] <- 1:length(c)

#plot, so you can see it
plot(c)    
plot(bound, add=TRUE) 

Fastest Method so far

result <- data.frame() #empty result dataframe 

system.time(
     for (i in 1:nrow(bound)) { #this is the number of polygons to iterate through
      single <- bound[i,] #selects a single polygon
      clip1 <- crop(c, extent(single)) #crops the raster to the extent of the polygon, I do this first because it speeds the mask up
      clip2 <- mask(clip1,single) #crops the raster to the polygon boundary

      ext<-extract(clip2,single) #extracts data from the raster based on the polygon bound
      tab<-lapply(ext,table) #makes a table of the extract output
      s<-sum(tab[[1]])  #sums the table for percentage calculation
      mat<- as.data.frame(tab) 
      mat2<- as.data.frame(tab[[1]]/s) #calculates percent
      final<-cbind(single@data$NAME,mat,mat2$Freq) #combines into single dataframe
      result<-rbind(final,result)
      })

   user  system elapsed 
  39.39    0.11   39.52 

Parallel Processing

Parallel processing cut the user time by half, but negated the benefit by doubling the system time. Raster uses this for the extract function, but unfortunately not for the crop or mask function. Unfortunately, this leaves a slighly larger amount of total elapsed time due to "waiting around" by the "IO."

beginCluster( detectCores() -1) #use all but one core

run code on multiple cores:

  user  system elapsed 
  23.31    0.68   42.01 

then end the cluster

endCluster()

Slow Method:
The alternative method of doing an extract directly from the raster function takes a lot lot longer, and I'm not sure about the data management to get it into the form I want:

system.time(ext<-extract(c,bound))
   user  system elapsed 
1170.64   14.41 1186.14 

Best Answer

I have finally gotten around to improving this function. I found that for my purposes, it was fastest to rasterize() the polygon first and use getValues() instead of extract(). The rasterizing isn't much faster than the original code for tabulating raster values in small polygons, but it shines when it came to large polygon areas that had large rasters to be cropped and the values extracted. I also found getValues() was much faster than the extract() function.

I also figured out the multi-core processing using foreach().

I hope this is useful for other people who want an R solution for extracting raster values from a polygon overlay. This is similar to the "Tabulate Intersection" of ArcGIS, which did not work well for me, returning empty outputs after hours of processing, like this user.

#initiate multicore cluster and load packages
library(foreach)
library(doParallel)
library(tcltk)
library(sp)
library(raster)

cores<- 7
cl <- makeCluster(cores, output="") #output should make it spit errors
registerDoParallel(cl)

Here's the function:

multicore.tabulate.intersect<- function(cores, polygonlist, rasterlayer){ 
  foreach(i=1:cores, .packages= c("raster","tcltk","foreach"), .combine = rbind) %dopar% {

    mypb <- tkProgressBar(title = "R progress bar", label = "", min = 0, max = length(polygonlist[[i]]), initial = 0, width = 300) 

    foreach(j = 1:length(polygonlist[[i]]), .combine = rbind) %do% {
      final<-data.frame()
      tryCatch({ #not sure if this is necessary now that I'm using foreach, but it is useful for loops.

        single <- polygonlist[[i]][j,] #pull out individual polygon to be tabulated

        dir.create (file.path("c:/rtemp",i,j,single@data$OWNER), showWarnings = FALSE) #creates unique filepath for temp directory
        rasterOptions(tmpdir=file.path("c:/rtemp",i,j, single@data$OWNER))  #sets temp directory - this is important b/c it can fill up a hard drive if you're doing a lot of polygons

        clip1 <- crop(rasterlayer, extent(single)) #crop to extent of polygon
        clip2 <- rasterize(single, clip1, mask=TRUE) #crops to polygon edge & converts to raster
        ext <- getValues(clip2) #much faster than extract
        tab<-table(ext) #tabulates the values of the raster in the polygon

        mat<- as.data.frame(tab)
        final<-cbind(single@data$OWNER,mat) #combines it with the name of the polygon
        unlink(file.path("c:/rtemp",i,j,single@data$OWNER), recursive = TRUE,force = TRUE) #delete temporary files
        setTkProgressBar(mypb, j, title = "number complete", label = j)

      }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) #trycatch error so it doesn't kill the loop

      return(final)
    }  
    #close(mypb) #not sure why but closing the pb while operating causes it to return an empty final dataset... dunno why. 
  }
}

So to use it, adjust the single@data$OWNER to fit with the column name of your identifying polygon (guess that could have been built into the function...) and put in:

myoutput <- multicore.tabulate.intersect(cores, polygonlist, rasterlayer)