[GIS] Memory and Processing Efficient Approach to Read a Timeseries in Python

gdalimagerynumpypython

I am trying to store a timeseries of high resolution raster images containing NDVI values into a dictionary for calculations. I am curious as to which approach may the most efficient when loading multiple large files into memory. I currently have tried two options that are both rather inefficient.

  1. I would loop through the folder containing all of the images and use GDAL to read the values of the images into a 3D Numpy array.

    img_stack = np.empty((image_width, image_length, len(image_files)), np.dtype('f'))
    
    for i, fname in enumerate(image_files):
        img = gdal.Open(os.path.join(fn + ("\{0}".format(fname)))).ReadAsArray()                 
        img_stack[:, :, i] = img                                                            
        img = None
    

    After being loaded into the array, a loop would go to every single pixel location and retrieve the value of that location in each image.

  2. I would do something similar to the previous approach, but instead of loading all of the images into img_stack first, I would go through the images one at a time to reduce memory usage, but it would result in a longer processing time.

Would there be a more efficient way to do this procedure, using Numpy or GDAL?

The structure of the dictionary is as follows:

z_stack = {x1:
          {y1: [z11a, z11b, z11c],
           y2: [z12a, z12b, z12c],
           y3: [z13a, z13b, z13c]},
       x2:
          {y1: [z21a, z21b, z21c],
           y2: [z22a, z22b, z22c],
           y3: [z23a, z23b, z23c]},
       x3:
          {y1: [z31a, z31b, z31c],
           y2: [z32a, z32b, z32c],
           y3: [z33a, z33b, z33c]}}

and the iteration to populate the dictionary is:

z_stack = {}
for x in width:
    z_stack[x] = {}
    for y in length:
        z_stack[x][y] = []
        for z in height:
            z_stack[x][y].append(z)

Best Answer

There's no ultimate answer to this because there's no getting away from memory vs speed tradeoffs.

One way to get 4x "more" memory would be to convert your NDVI arrays from float to signed Int16. You'd still have 4 decimal places precision, and a lot of the float precision was false anyway since the source imagery would have been (I presume) 8 or 16 bit to begin with. Do your math in integer space and convert to floats at the end.

Another way to get more memory would be to distribute the problem over several machines. If your problem allows you to analyze the images one at a time, it should be distributable.

Also, don't overlook Rasterio, https://github.com/mapbox/rasterio, a newer and better Python GDAL library for reading and writing raster data.