Python – Efficiently Reading VRT GeoTIFF Image Stacks

gdalpythontime seriesvrt

My aim is to read time series with Python from a stack of spatially aligned GeoTIFF files as efficient as possible. The time series is not only limited to one pixel, but can also relate to a certain region of interest, delineated by a bounding box. To do so, I am creating a VRT file stacking all relevant GeoTIFF files in the right order. Then, I open the VRT file and extract the time series by specifying the pixel coordinates or bounding box of interest.

I tested this procedure on two systems:

Local Windows 10 PC with 4 physical cores, 32GB RAM. Data is stored on a NTFS HDD.
Centos 7 virtual machine on a cluster with 16 physical cores, 64GB RAM. Data is stored on a distributed file system (I don't have more detailed information here).

When comparing the reading performance on both systems, 2 is much slower e.g., 2-3 times.

Why doesn't VRT/GDAL use multiple cores to read data stored at different locations (as it is the case regarding 2)?

Best Answer

From your question, you appear to want to read time series of individual pixels. The fastest way I have found for this is to convert the VRT to a geotiff with option "INTERLEAVE=BAND".

If that's not an option (because it takes loads of disk space), you can also use a ThreadPool:

from concurrent.futures import ThreadPoolExecutor
from functools import partial
from osgeo import gdal

def read_pxl(band,x,y, g):
    return g.GetRasterBand(band).ReadAsArray(xoff=x, yoff=y,
                                win_xsize=1, win_ysize=1)
g = gdal.Open(my_super_duper_fname)
n_bands = g.RasterCount
wrapper=partial(read_pxl, g=g, x=1800, y=600)
with ThreadPoolExecutor(max_workers=20) as exec:
    retval = exec.map(wrapper, range(1, n_bands + 1))

In your case, if you use the threadpool it may just be faster to index the independent GeoTIFF files directly without going via the VRT.

Related Solutions

[GIS] GDAL VRT Format – do I understand the concept

I love processing with vrt's. you can make lots of interim changes. Evaluate them quickly in QGIS and if you like any of them just translate back to a selfcontained raster format (tif, png etc).

saves lots of time.

U2ros,your uses of vrt's makes perfect sense, to me at least :) mosaicking and then clipping is what I originally started using vrts for: eliminating interim rasters that I would delete later.

check out this link: http://www.perrygeo.com/lazy-raster-processing-with-gdal-vrts.html [link adjusted since the blog has been moved]

hope it has something for you.

[GIS] How to efficiently extract time series from a very large multi-band GeoTIFF in R

I've tried out a bit and found that:

RasterBricks, as mentioned by RobertH's answer, do work and are more user-friendly and easy to use;
Rgdal methods like readGDAL also work, but with more parameters it's a little bit less user-friendly;

So which option should one use?

According to my tests (on my 420GB GeoTiff with dimensions of 18660x21592 and 374 bands) Rgdal is faster. Maybe due to less overhead of a higher level library such as the Raster-package.

Here are my results, using system.time and replicate:

With brick:

> modis_ndvi_ts_brick <- brick("../data/pa_br_mod13q1_ndvi_250_2000_2016.tif")
> system.time(replicate(100, modis_ndvi_ts_brick[7000,7000]))
   user  system elapsed 
 94.024   5.468  99.562

While with rgdal:

> system.time(replicate(100, readGDAL("../data/pa_br_mod13q1_ndvi_250_2000_2016.tif", offset=c(7000,7000), region.dim=c(1,1))))
#some printed output from readGDAL here
   user  system elapsed 
 88.752   5.400  94.213

So as you can see Rgdal is slightly faster. For those whose this does not make a difference, I recommend using RasterBrick, it's simpler. But for those whose, like me, are struggling to create a high performance code in which every millisecond matters: Use Rgdal

Note: I'm sorry for the not exactly reproducible data, but one could try it out with other data and post here if their results differ somehow

Best Answer

Related Solutions

[GIS] GDAL VRT Format – do I understand the concept

[GIS] How to efficiently extract time series from a very large multi-band GeoTIFF in R

Related Question