[GIS] How to efficiently extract time series from a very large multi-band GeoTIFF in R

large datasetsrrasterrgdaltime series

First some context: I want to run a time-series analysis function that is only available in R for a continental sized multi-band time-series raster, which is stored in disk in GeoTIFF format, that was created translating a VRT file to a GeoTIFF with GDAL using parameter INTERLEAVE='PIXEL' (That parameter should store all band values for each pixel continuously on the disk an make it easy and fast to read.)

The function only needs the time-series to work, so no spatial dependency. And the file has 380 bands, each of continental size, summing up 420GB on disk space.

The problem: I used the raster package to read the file as a stack, but when I try to get the values for a pixel, R just stucks and appears to be processing, but this operation should be fast. I tried using the [] notation, using the extract function and none worked. (Even if I try to get only 3 bands from a small RGB file, R seems slow.)

Using gdallocationinfo it is extremely fast, and with Python and GDAL module it's also incredibly fast.

So shouldn't it be fast in R too? Am I doing something wrong or missing a function designed for that? Doesn't the Raster package use rgdal internally?

But most importantly, is there a way to do it using rgdal then?

Best Answer

I've tried out a bit and found that:

  1. RasterBricks, as mentioned by RobertH's answer, do work and are more user-friendly and easy to use;
  2. Rgdal methods like readGDAL also work, but with more parameters it's a little bit less user-friendly;

So which option should one use?

According to my tests (on my 420GB GeoTiff with dimensions of 18660x21592 and 374 bands) Rgdal is faster. Maybe due to less overhead of a higher level library such as the Raster-package.

Here are my results, using system.time and replicate:

With brick:

> modis_ndvi_ts_brick <- brick("../data/pa_br_mod13q1_ndvi_250_2000_2016.tif")
> system.time(replicate(100, modis_ndvi_ts_brick[7000,7000]))
   user  system elapsed 
 94.024   5.468  99.562

While with rgdal:

> system.time(replicate(100, readGDAL("../data/pa_br_mod13q1_ndvi_250_2000_2016.tif", offset=c(7000,7000), region.dim=c(1,1))))
#some printed output from readGDAL here
   user  system elapsed 
 88.752   5.400  94.213 

So as you can see Rgdal is slightly faster. For those whose this does not make a difference, I recommend using RasterBrick, it's simpler. But for those whose, like me, are struggling to create a high performance code in which every millisecond matters: Use Rgdal


Note: I'm sorry for the not exactly reproducible data, but one could try it out with other data and post here if their results differ somehow