I have a large raster file (245295396) cells and stacks of rasters having 4 layers each which lie in the extent of this large raster. To start with I am trying to get value from one stack (3 channels) and for the same zone from the large raster. Every things works fine, just the extraction from large raster takes 5 mins. So, if I repeat this process for 4000 more times it will take 13 days.
cld<- raster("cdl_30m_r_il_2014_albers.tif") #this is the large raster
r<- stack(paste(path,"/data_robin/", fl,sep="")) #1 stack,I have 4000 similar
mat<-as.data.frame(getValues(r)) # getting values from the stack
xy<-xyFromCell(r,c(1:ncell(r)),spatial = TRUE)
clip1 <- crop(cld, extent(r)) # Tried to crop it to a smaller size
cells<-cellFromXY(clip1,xy)
mat$landuse<- NA
# mat$landuse<-cld[cells]
mat$landuse<- extract(clip1,cells) #this line takes 5 mins based on profiling
> cld
class : RasterLayer
dimensions : 20862, 11758, 245295396 (nrow, ncol, ncell)
resolution : 30, 30 (x, y)
extent : 378585, 731325, 1569045, 2194905 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0
data source : /Users/kaswani/R/Image/cdl_30m_r_il_2014_albers.tif
names : cdl_30m_r_il_2014_albers
values : 0, 255 (min, max)
> r
class : RasterStack
dimensions : 9230, 7502, 69243460, 4 (nrow, ncol, ncell, nlayers)
resolution : 0.7995722, 0.7995722 (x, y)
extent : 589084.4, 595082.8, 1564504, 1571884 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
names : m_3608906_ne_16.1, m_3608906_ne_16.2, m_3608906_ne_16.3, m_3608906_ne_16.4
min values : 0, 0, 0, 0
max values : 255, 255, 255, 255
My data is in .tiff format and I am new to geospatial coding.
I have also tried the approach at Increasing speed of crop, mask, & extract raster by many polygons in R? but during the masking part it gives an error Error in compareRaster(x, mask) : different extent.
Best Answer
1)
resample
results in 50% improvementI was able to get about 50% improvement by resampling directly from the
cld
raster to a new raster with the same extent/resolution asr
and a nearest neighbor sampling method:vs.
On a smaller dataset, and with a much smaller memory footprint
2) Parallelization could get you a lot more
That will improve things significantly but you can get massive improvement if you can parallelize this. R comes with a couple parallelization backends and they all run through
foreach
. I assume you are going to either processmat
in place or save it for later. Since it takes so much work to get that resampled data let's just assume we'll save it for later. The most convenient form is probably a raster alongside thedata_robin
files.Unfortunately, Windows and Unix parallelization options differ. On linux, use
doMC
, on Windows usedoSNOW
. Assuming we employ 4 workers:linux initialization:
windows initialization:
next:
One of the drawbacks of the parallel
foreach
is that it's hard to tell when something goes wrong. It would be good to do this serially first by replacing the%dopar%
with%do%
until you know it is working, and then let it run through the whole thing.Caveats In my simple example above (each raster had 1/100 the pixels in
cld
andr
, respectively) I only improved an additional 30% by engaging 5 workers over doing it serially with just the single process. I was unable to parallelize the example with large rasters without gettingError in { : task 1 failed - "cannot allocate vector of size xx.x Mb"
. I think conceptually this should work but I wasn't able to get it working at the scale you are working at.