Python – Efficiently Read Large Tif Raster to Numpy Array with GDAL

gdalgeotiff-tiffnumpypythonraster

I have been using the Python GDAL API to read tif raster files as NumPy arrays. Previously, I simply read the raster into an array directly with GDAL:

ds = gdal.Open('example.tif')
fullarray = np.array(ds.ReadAsArray())

However, with larger rasters, I receive a MemoryError. As a workaround, I have been looping over the raster and reading windows into a NumPy array:

#arbitrarily choosing rows, cols value here
rows, cols = 20000, 20000
arr = np.zeros((4, rows, cols))
band = ds.GetRasterBand(1)
xsize, ysize = band.XSize, band.YSize
x_edge, y_edge = int(xsize - cols + 1), int(ysize - rows + 1)
x_extra, y_extra = int(x_edge%cols), int(y_edge%rows)

for i in tqdm(range(0, x_edge, cols)):
    for j in range(0, y_edge, rows):

        #read dataset into pre-allocated array
        ds.ReadAsArray(i, j, cols, rows, arr)

I experienced some significant speed-ups (at least 2x faster) after using a pre-allocated array. However, this is still quite a bottleneck in my code. Is there a better way to do this?

Best Answer

You can improve your i/o time by considering the blocksize of the underlying raster. "Blocksize" refers to the size of the chunks that the raster is written to the hard disk. Typical blocksizes are 128 x 128, or 256 x 256, but it is not uncommon for the blocksize to be N x 1, where N is the number of columns in your raster (or some other non-square format). If this is your case, then considering blocksize will really speed up your reading to memory.

I had a very similar issue; here is some code that I developed that tries to optimize your tile size based on the tiff's blocksize. You want your tile size to be multiples of your blocksize (in x- and y-dimensions). Otherwise, I don't think there's much you can do unless you're willing/able to manipulate the original raster or buy an SSD.

Related Solutions

Write Numpy Array to Raster File – Converting Numpy Arrays to Raster Format

Below is an example that I wrote for a workshop that utilizes the numpy and gdal Python modules. It reads data from one .tif file into a numpy array, does a reclass of the values in the array and then writes it back out to a .tif.

From your explanation, it sounds like you might have succeeded in writing out a valid file, but you just need to symbolize it in QGIS. If I remember correctly, when you first add a raster, it often shows up all one color if you don't have a pre-existing color map.

import numpy, sys
from osgeo import gdal
from osgeo.gdalconst import *


# register all of the GDAL drivers
gdal.AllRegister()

# open the image
inDs = gdal.Open("c:/workshop/examples/raster_reclass/data/cropland_40.tif")
if inDs is None:
  print 'Could not open image file'
  sys.exit(1)

# read in the crop data and get info about it
band1 = inDs.GetRasterBand(1)
rows = inDs.RasterYSize
cols = inDs.RasterXSize

cropData = band1.ReadAsArray(0,0,cols,rows)

listAg = [1,5,6,22,23,24,41,42,28,37]
listNotAg = [111,195,141,181,121,122,190,62]

# create the output image
driver = inDs.GetDriver()
#print driver
outDs = driver.Create("c:/workshop/examples/raster_reclass/output/reclass_40.tif", cols, rows, 1, GDT_Int32)
if outDs is None:
    print 'Could not create reclass_40.tif'
    sys.exit(1)

outBand = outDs.GetRasterBand(1)
outData = numpy.zeros((rows,cols), numpy.int16)


for i in range(0, rows):
    for j in range(0, cols):

    if cropData[i,j] in listAg:
        outData[i,j] = 100
    elif cropData[i,j] in listNotAg:
        outData[i,j] = -100
    else:
        outData[i,j] = 0


# write the data
outBand.WriteArray(outData, 0, 0)

# flush data to disk, set the NoData value and calculate stats
outBand.FlushCache()
outBand.SetNoDataValue(-99)

# georeference the image and set the projection
outDs.SetGeoTransform(inDs.GetGeoTransform())
outDs.SetProjection(inDs.GetProjection())

del outData

[GIS] Gdal Dataset.ReadAsArray() crashes Python

As suspected by many commenters, this was an issue with my install. Apparently, i was not paying close enough attention when installing GDAL and the Python Bindings.

I installed GDAL Core and plugins (dll's) from gisinternals.com, but somehow I didn't think to install the Python Bindings from there as well. The Python Bindings I installed were from a different site (can't remember which one at this point).

When I reinstalled GDAL and Python Bindings all from gisinternals.com, I was able to successfully ReadAsArray.

Thank you to all who commented and answered and I apologize for my ignorance.

Best Answer

Related Solutions

Write Numpy Array to Raster File – Converting Numpy Arrays to Raster Format

[GIS] Gdal Dataset.ReadAsArray() crashes Python

Related Question