Python – Efficiently Reading VRT GeoTIFF Image Stacks

gdalpythontime seriesvrt

My aim is to read time series with Python from a stack of spatially aligned GeoTIFF files as efficient as possible. The time series is not only limited to one pixel, but can also relate to a certain region of interest, delineated by a bounding box. To do so, I am creating a VRT file stacking all relevant GeoTIFF files in the right order. Then, I open the VRT file and extract the time series by specifying the pixel coordinates or bounding box of interest.

I tested this procedure on two systems:

  1. Local Windows 10 PC with 4 physical cores, 32GB RAM. Data is stored on a NTFS HDD.
  2. Centos 7 virtual machine on a cluster with 16 physical cores, 64GB RAM. Data is stored on a distributed file system (I don't have more detailed information here).

When comparing the reading performance on both systems, 2 is much slower e.g., 2-3 times.

Why doesn't VRT/GDAL use multiple cores to read data stored at different locations (as it is the case regarding 2)?

Best Answer

From your question, you appear to want to read time series of individual pixels. The fastest way I have found for this is to convert the VRT to a geotiff with option "INTERLEAVE=BAND".

If that's not an option (because it takes loads of disk space), you can also use a ThreadPool:

from concurrent.futures import ThreadPoolExecutor
from functools import partial
from osgeo import gdal

def read_pxl(band,x,y, g):
    return g.GetRasterBand(band).ReadAsArray(xoff=x, yoff=y,
                                win_xsize=1, win_ysize=1)
g = gdal.Open(my_super_duper_fname)
n_bands = g.RasterCount
wrapper=partial(read_pxl, g=g, x=1800, y=600)
with ThreadPoolExecutor(max_workers=20) as exec:
    retval = exec.map(wrapper, range(1, n_bands + 1))

In your case, if you use the threadpool it may just be faster to index the independent GeoTIFF files directly without going via the VRT.