Python – Efficiently Read Large Tif Raster to Numpy Array with GDAL

gdalgeotiff-tiffnumpypythonraster

I have been using the Python GDAL API to read tif raster files as NumPy arrays. Previously, I simply read the raster into an array directly with GDAL:

ds = gdal.Open('example.tif')
fullarray = np.array(ds.ReadAsArray())

However, with larger rasters, I receive a MemoryError. As a workaround, I have been looping over the raster and reading windows into a NumPy array:

#arbitrarily choosing rows, cols value here
rows, cols = 20000, 20000
arr = np.zeros((4, rows, cols))
band = ds.GetRasterBand(1)
xsize, ysize = band.XSize, band.YSize
x_edge, y_edge = int(xsize - cols + 1), int(ysize - rows + 1)
x_extra, y_extra = int(x_edge%cols), int(y_edge%rows)

for i in tqdm(range(0, x_edge, cols)):
    for j in range(0, y_edge, rows):

        #read dataset into pre-allocated array
        ds.ReadAsArray(i, j, cols, rows, arr)

I experienced some significant speed-ups (at least 2x faster) after using a pre-allocated array. However, this is still quite a bottleneck in my code. Is there a better way to do this?

Best Answer

You can improve your i/o time by considering the blocksize of the underlying raster. "Blocksize" refers to the size of the chunks that the raster is written to the hard disk. Typical blocksizes are 128 x 128, or 256 x 256, but it is not uncommon for the blocksize to be N x 1, where N is the number of columns in your raster (or some other non-square format). If this is your case, then considering blocksize will really speed up your reading to memory.

I had a very similar issue; here is some code that I developed that tries to optimize your tile size based on the tiff's blocksize. You want your tile size to be multiples of your blocksize (in x- and y-dimensions). Otherwise, I don't think there's much you can do unless you're willing/able to manipulate the original raster or buy an SSD.