[GIS] Really slow extraction from raster even after using crop

clipextract-by-maskparallel processingrraster

I have a large raster file (245295396) cells and stacks of rasters having 4 layers each which lie in the extent of this large raster. To start with I am trying to get value from one stack (3 channels) and for the same zone from the large raster. Every things works fine, just the extraction from large raster takes 5 mins. So, if I repeat this process for 4000 more times it will take 13 days.

cld<- raster("cdl_30m_r_il_2014_albers.tif") #this is the large raster
r<- stack(paste(path,"/data_robin/", fl,sep="")) #1 stack,I have 4000 similar
mat<-as.data.frame(getValues(r)) # getting values from the stack
xy<-xyFromCell(r,c(1:ncell(r)),spatial = TRUE)
clip1 <- crop(cld, extent(r)) # Tried to crop it to a smaller size
cells<-cellFromXY(clip1,xy)
mat$landuse<- NA
# mat$landuse<-cld[cells]
mat$landuse<- extract(clip1,cells) #this line takes 5 mins based on profiling

> cld
class       : RasterLayer 
dimensions  : 20862, 11758, 245295396  (nrow, ncol, ncell)
resolution  : 30, 30  (x, y)
extent      : 378585, 731325, 1569045, 2194905  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0 
data source : /Users/kaswani/R/Image/cdl_30m_r_il_2014_albers.tif 
names       : cdl_30m_r_il_2014_albers 
values      : 0, 255  (min, max) 

> r
class       : RasterStack 
dimensions  : 9230, 7502, 69243460, 4  (nrow, ncol, ncell, nlayers)
resolution  : 0.7995722, 0.7995722  (x, y)
extent      : 589084.4, 595082.8, 1564504, 1571884  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
names       : m_3608906_ne_16.1, m_3608906_ne_16.2, m_3608906_ne_16.3, m_3608906_ne_16.4 
min values  :                 0,                 0,                 0,                 0 
max values  :               255,               255,               255,               255 

My data is in .tiff format and I am new to geospatial coding.

I have also tried the approach at Increasing speed of crop, mask, & extract raster by many polygons in R? but during the masking part it gives an error Error in compareRaster(x, mask) : different extent.

Best Answer

1) resample results in 50% improvement

I was able to get about 50% improvement by resampling directly from the cld raster to a new raster with the same extent/resolution as r and a nearest neighbor sampling method:

system.time({
  mat<-as.data.frame(getValues(r))
  mat$landuse<- NA
  mat$landuse<-getValues(resample(cld,r,method='ngb'))
})
   user  system elapsed 
   2.60    0.00    2.61

vs.

system.time({
  mat<-as.data.frame(getValues(r)) # getting values from the stack
  xy<-xyFromCell(r,c(1:ncell(r)),spatial = TRUE)
  cells<-cellFromXY(clip1,xy)
  mat$landuse<- NA
  mat$landuse<- extract(clip1,cells) #this line takes 5 mins based on profiling
})
   user  system elapsed 
   4.98    0.00    5.02 

On a smaller dataset, and with a much smaller memory footprint

2) Parallelization could get you a lot more

That will improve things significantly but you can get massive improvement if you can parallelize this. R comes with a couple parallelization backends and they all run through foreach. I assume you are going to either process mat in place or save it for later. Since it takes so much work to get that resampled data let's just assume we'll save it for later. The most convenient form is probably a raster alongside the data_robin files.

Unfortunately, Windows and Unix parallelization options differ. On linux, use doMC, on Windows use doSNOW. Assuming we employ 4 workers:

linux initialization:

library(doMC)
registerDoMC(4) # number of workers should be less than number of CPU cores

windows initialization:

library(doSNOW)
cluster<-makeCluster(4, type = "SOCK") # num workers should be < num CPU cores
registerDoSNOW(cluster)

next:

library(foreach)
library(tools)

# Assume you have an array of filenames called 'files'
foreach (i=1:length(filenames), .packages=c('raster')) %dopar% {
    r <- stack(paste0(path, "/data_robin/", files[i]))
    outFilename=paste0(path, "/data_robin/", file_path_sans_ext(files[i]), "_cld.tif")
    cldResampled <- resample(cld,r,method='ngb')
    writeRaster(cldResampled, filename=outFilename, format="GTiff")
}

One of the drawbacks of the parallel foreach is that it's hard to tell when something goes wrong. It would be good to do this serially first by replacing the %dopar% with %do% until you know it is working, and then let it run through the whole thing.

Caveats In my simple example above (each raster had 1/100 the pixels in cld and r, respectively) I only improved an additional 30% by engaging 5 workers over doing it serially with just the single process. I was unable to parallelize the example with large rasters without getting Error in { : task 1 failed - "cannot allocate vector of size xx.x Mb". I think conceptually this should work but I wasn't able to get it working at the scale you are working at.

Related Question