My aim is to read time series with Python from a stack of spatially aligned GeoTIFF files as efficient as possible. The time series is not only limited to one pixel, but can also relate to a certain region of interest, delineated by a bounding box. To do so, I am creating a VRT file stacking all relevant GeoTIFF files in the right order. Then, I open the VRT file and extract the time series by specifying the pixel coordinates or bounding box of interest.
I tested this procedure on two systems:
- Local Windows 10 PC with 4 physical cores, 32GB RAM. Data is stored on a NTFS HDD.
- Centos 7 virtual machine on a cluster with 16 physical cores, 64GB RAM. Data is stored on a distributed file system (I don't have more detailed information here).
When comparing the reading performance on both systems, 2 is much slower e.g., 2-3 times.
Why doesn't VRT/GDAL use multiple cores to read data stored at different locations (as it is the case regarding 2)?
Best Answer
From your question, you appear to want to read time series of individual pixels. The fastest way I have found for this is to convert the VRT to a geotiff with option "INTERLEAVE=BAND".
If that's not an option (because it takes loads of disk space), you can also use a ThreadPool:
In your case, if you use the threadpool it may just be faster to index the independent GeoTIFF files directly without going via the VRT.