[GIS] Really slow extraction from raster even after using crop

clipextract-by-maskparallel processingrraster

I have a large raster file (245295396) cells and stacks of rasters having 4 layers each which lie in the extent of this large raster. To start with I am trying to get value from one stack (3 channels) and for the same zone from the large raster. Every things works fine, just the extraction from large raster takes 5 mins. So, if I repeat this process for 4000 more times it will take 13 days.

cld<- raster("cdl_30m_r_il_2014_albers.tif") #this is the large raster
r<- stack(paste(path,"/data_robin/", fl,sep="")) #1 stack,I have 4000 similar
mat<-as.data.frame(getValues(r)) # getting values from the stack
xy<-xyFromCell(r,c(1:ncell(r)),spatial = TRUE)
clip1 <- crop(cld, extent(r)) # Tried to crop it to a smaller size
cells<-cellFromXY(clip1,xy)
mat$landuse<- NA
# mat$landuse<-cld[cells]
mat$landuse<- extract(clip1,cells) #this line takes 5 mins based on profiling

> cld
class       : RasterLayer 
dimensions  : 20862, 11758, 245295396  (nrow, ncol, ncell)
resolution  : 30, 30  (x, y)
extent      : 378585, 731325, 1569045, 2194905  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0 
data source : /Users/kaswani/R/Image/cdl_30m_r_il_2014_albers.tif 
names       : cdl_30m_r_il_2014_albers 
values      : 0, 255  (min, max) 

> r
class       : RasterStack 
dimensions  : 9230, 7502, 69243460, 4  (nrow, ncol, ncell, nlayers)
resolution  : 0.7995722, 0.7995722  (x, y)
extent      : 589084.4, 595082.8, 1564504, 1571884  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
names       : m_3608906_ne_16.1, m_3608906_ne_16.2, m_3608906_ne_16.3, m_3608906_ne_16.4 
min values  :                 0,                 0,                 0,                 0 
max values  :               255,               255,               255,               255

My data is in .tiff format and I am new to geospatial coding.

I have also tried the approach at Increasing speed of crop, mask, & extract raster by many polygons in R? but during the masking part it gives an error Error in compareRaster(x, mask) : different extent.

Best Answer

1) resample results in 50% improvement

I was able to get about 50% improvement by resampling directly from the cld raster to a new raster with the same extent/resolution as r and a nearest neighbor sampling method:

system.time({
  mat<-as.data.frame(getValues(r))
  mat$landuse<- NA
  mat$landuse<-getValues(resample(cld,r,method='ngb'))
})
   user  system elapsed 
   2.60    0.00    2.61

vs.

system.time({
  mat<-as.data.frame(getValues(r)) # getting values from the stack
  xy<-xyFromCell(r,c(1:ncell(r)),spatial = TRUE)
  cells<-cellFromXY(clip1,xy)
  mat$landuse<- NA
  mat$landuse<- extract(clip1,cells) #this line takes 5 mins based on profiling
})
   user  system elapsed 
   4.98    0.00    5.02

On a smaller dataset, and with a much smaller memory footprint

2) Parallelization could get you a lot more

That will improve things significantly but you can get massive improvement if you can parallelize this. R comes with a couple parallelization backends and they all run through foreach. I assume you are going to either process mat in place or save it for later. Since it takes so much work to get that resampled data let's just assume we'll save it for later. The most convenient form is probably a raster alongside the data_robin files.

Unfortunately, Windows and Unix parallelization options differ. On linux, use doMC, on Windows use doSNOW. Assuming we employ 4 workers:

linux initialization:

library(doMC)
registerDoMC(4) # number of workers should be less than number of CPU cores

windows initialization:

library(doSNOW)
cluster<-makeCluster(4, type = "SOCK") # num workers should be < num CPU cores
registerDoSNOW(cluster)

library(foreach)
library(tools)

# Assume you have an array of filenames called 'files'
foreach (i=1:length(filenames), .packages=c('raster')) %dopar% {
    r <- stack(paste0(path, "/data_robin/", files[i]))
    outFilename=paste0(path, "/data_robin/", file_path_sans_ext(files[i]), "_cld.tif")
    cldResampled <- resample(cld,r,method='ngb')
    writeRaster(cldResampled, filename=outFilename, format="GTiff")
}

One of the drawbacks of the parallel foreach is that it's hard to tell when something goes wrong. It would be good to do this serially first by replacing the %dopar% with %do% until you know it is working, and then let it run through the whole thing.

Caveats In my simple example above (each raster had 1/100 the pixels in cld and r, respectively) I only improved an additional 30% by engaging 5 workers over doing it serially with just the single process. I was unable to parallelize the example with large rasters without getting Error in { : task 1 failed - "cannot allocate vector of size xx.x Mb". I think conceptually this should work but I wasn't able to get it working at the scale you are working at.

Load libraries and example data:

# Load libraries
library('raster')
library('rgdal')

# Load a SpatialPolygonsDataFrame example
# Load Brazil administrative level 2 shapefile
BRA_adm2 <- raster::getData(country = "BRA", level = 2)

# Convert NAMES level 2 to factor 
BRA_adm2$NAME_2 <- as.factor(BRA_adm2$NAME_2)

# Plot BRA_adm2
plot(BRA_adm2)
box()

# Define RasterLayer object
r.raster <- raster()

# Define raster extent
extent(r.raster) <- extent(BRA_adm2)

# Define pixel size
res(r.raster) <- 0.1

Figure 1: Brazil SpatialPolygonsDataFrame plot

Simple thread example

# Simple thread -----------------------------------------------------------

# Rasterize
system.time(BRA_adm2.r <- rasterize(BRA_adm2, r.raster, 'NAME_2'))

Time in my laptop:

# Output:
# user  system elapsed 
# 23.883    0.010   23.891

Multithread thread example

# Multithread -------------------------------------------------------------

# Load 'parallel' package for support Parallel computation in R
library('parallel')

# Calculate the number of cores
no_cores <- detectCores() - 1

# Number of polygons features in SPDF
features <- 1:nrow(BRA_adm2[,])

# Split features in n parts
n <- 50
parts <- split(features, cut(features, n))

# Initiate cluster (after loading all the necessary object to R environment: BRA_adm2, parts, r.raster, n)
cl <- makeCluster(no_cores, type = "FORK")
print(cl)

# Parallelize rasterize function
system.time(rParts <- parLapply(cl = cl, X = 1:n, fun = function(x) rasterize(BRA_adm2[parts[[x]],], r.raster, 'NAME_2')))

# Finish
stopCluster(cl)

# Merge all raster parts
rMerge <- do.call(merge, rParts)

# Plot raster
plot(rMerge)

Figure 2: Brazil Raster plot

Time in my laptop:

# Output:
# user  system elapsed 
# 0.203   0.033   8.688

More info about parallelization in R:

[GIS] Opening rotated raster in R

gdalinfo on the file reports the following transformation:

GeoTransform =
  259784.9999996533, 30.0000000002095, -4.662376007821301e-11
  7491805.000000283, -1.623406322724229e-10, -29.99999999998893

To six decimal places those numbers are either integers or zero, and would correspond to an unrotated raster with the expected coordinates. So any transformation (rotation/skew) is probably negligible.

When I read your raster in I can't even plot it without an error:

> r = raster("./LE07_L1TP_219076_20021029_20170127_01_T1_B4.tif")
Warning message:
In .rasterFromGDAL(x, band = band, objecttype, ...) : 

 This file has a rotation
 Support such files is limited and results of data processing might be wrong.
 Proceed with caution & consider using the "rectify" function

> plot(r)
Error in if (nc < 1) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In setExtent(out, e, keepres = TRUE) :
  NAs introduced by coercion to integer range

The rotation matrix parameters are here:

> r@rotation@geotrans
         ll.x         res.x     oblique.x          <NA>     oblique.y 
 2.597850e+05  3.000000e+01  4.662376e-11  7.491805e+06  1.623406e-10 
        res.y 
-3.000000e+01

and if you believe these are negligible then mark the raster as unrotated:

> r@rotated=FALSE
> plot(r)
>

and it looks like its in the right place. I don't know if QGIS is actually doing the rotation correctly, or ignoring it, I'd need a raster with a bigger rotation to figure that out! I do get this message from QGIS:

 Warning: Creating Warped VRT.

which I think is what it says when it has to do a transformation.

But for your raster and any more with near-identical transformations, you can set the rotate flag to false.

The reason it fails in R is because res is not getting the resolution (cell size) correctly. Look:

 resolution : 4.662376e-11, 1.623406e-10 (x, y)

and that is because the res function does this:

 if (rotated(x)) {
    return(x@rotation@geotrans[c(3, 5)])
 }

and the geotrans slot is this:

> r@rotation@geotrans
         ll.x         res.x     oblique.x          <NA>     oblique.y 
 2.597850e+05  3.000000e+01  4.662376e-11  7.491805e+06  1.623406e-10 
        res.y 
-3.000000e+01

so res should look at elements 2 and 6 to get the resolution, not 3 and 5.

Best Answer

Related Solutions

Process Vector to Raster Faster with R – Speed Optimization Techniques

Load libraries and example data:

Simple thread example

Multithread thread example

[GIS] Opening rotated raster in R

Related Question