[GIS] Converting GeoTIFF to np array

arraygdalgeotiff-tiffnumpypython

I am fairly new to GIS in general.

I am having some issues converting a GeoTIFF into an numpy array. There are multiple questions online about how to do it, I've tried, many, with the same result each time.

I have a GeoTIFF showing the repartition of a parasite in a country, which, when I open it (with rasterio.open) shows:

Basically what I'd like to do, is turning this heatmap into multiple polygons by different thresholds, which would turn into some sort of topological map. I would get polygons where the concentration of the parasite would be >0.8 for example, then where it is 0.8>concentration>0.5, etc. I thought the best way would be to convert the image into an array, and simply using np.where(), keep parts of the images with pixel values in between 2 thresholds.

However, when I try to do so, this is what I get:

Numpy turns everything outside of the boundaries of the GeoTIFF into -9999, which I don't mind. However, I lost all the information inside the boundaries, and I don't really know how to keep it. Here is the code used:

data = gdal.Open(PR_files[10])
img = data.GetRasterBand(1)
raster = data.ReadAsArray()
show(raster)

(Note that I have multiple images over the years, with PR_files being the list where I keep all the paths to these images, so I just fetch one as an example. My images only have 1 band by the way).

What I would ultimately like to do, is turn these shapes into a shapely file, and use these Polygons to do some LULC prediction. I have similar images for each year for the past 17 years, and I'd basically try to see if it would be possible to predict the 18th with this method. I have no idea if this would work, but that's why I want to give it a try. This is just a personal project I'm doing to mostly get more familiar with GIS in general.

Best Answer

I agree with @Devin Cairns. Most likely the situation is a difference between both visualizations rather than their underlying values. If you want to be sure that the values are different inside your area of interest, try the following code:

data = gdal.Open(PR_files[10])
img = data.GetRasterBand(1)
raster = img.ReadAsArray()
nd_value = img.GetNoDataValue()
unique_values = np.unique(raster[raster!=nd_value])
print(unique_values)

This will print all the values inside the are of interest because it ignores NoData values. If you get only one value, there is a problem with the data. If you get more values, you'll know that you are probably reading the array properly. If your numbers are continuous rather than discrete, you might be better of printing the array instead of computing unique values. That can be accomplished with the following line:

print(raster[raster!=nd_value])

Furthermore, it would be good to know what your show() functions looks like. Python plotting libraries such as matplotlib let you pass masked arrays (as Devin commented) and you can specify the color palettes.

Regarding the scope of your project, I think the most straightforward process would be:

1) Reclassify your raster so you get an integer value for each class. For example:

Integer value 1: concentrations above 0.8
Integer value 2: concentrations between 0.5 and 0.8
Integer value 3: concentrations below 0.5

2) Turn your raster to polygons. This will produce polygons for each different class (integer) you have.

There are plenty of questions on this site that deal with reclassifying a raster and vectorizing it (raster to polygon) using Python so I'll leave that to you.

Related Solutions

Write Numpy Array to Raster File – Converting Numpy Arrays to Raster Format

Below is an example that I wrote for a workshop that utilizes the numpy and gdal Python modules. It reads data from one .tif file into a numpy array, does a reclass of the values in the array and then writes it back out to a .tif.

From your explanation, it sounds like you might have succeeded in writing out a valid file, but you just need to symbolize it in QGIS. If I remember correctly, when you first add a raster, it often shows up all one color if you don't have a pre-existing color map.

import numpy, sys
from osgeo import gdal
from osgeo.gdalconst import *


# register all of the GDAL drivers
gdal.AllRegister()

# open the image
inDs = gdal.Open("c:/workshop/examples/raster_reclass/data/cropland_40.tif")
if inDs is None:
  print 'Could not open image file'
  sys.exit(1)

# read in the crop data and get info about it
band1 = inDs.GetRasterBand(1)
rows = inDs.RasterYSize
cols = inDs.RasterXSize

cropData = band1.ReadAsArray(0,0,cols,rows)

listAg = [1,5,6,22,23,24,41,42,28,37]
listNotAg = [111,195,141,181,121,122,190,62]

# create the output image
driver = inDs.GetDriver()
#print driver
outDs = driver.Create("c:/workshop/examples/raster_reclass/output/reclass_40.tif", cols, rows, 1, GDT_Int32)
if outDs is None:
    print 'Could not create reclass_40.tif'
    sys.exit(1)

outBand = outDs.GetRasterBand(1)
outData = numpy.zeros((rows,cols), numpy.int16)


for i in range(0, rows):
    for j in range(0, cols):

    if cropData[i,j] in listAg:
        outData[i,j] = 100
    elif cropData[i,j] in listNotAg:
        outData[i,j] = -100
    else:
        outData[i,j] = 0


# write the data
outBand.WriteArray(outData, 0, 0)

# flush data to disk, set the NoData value and calculate stats
outBand.FlushCache()
outBand.SetNoDataValue(-99)

# georeference the image and set the projection
outDs.SetGeoTransform(inDs.GetGeoTransform())
outDs.SetProjection(inDs.GetProjection())

del outData

[GIS] Memory and Processing Efficient Approach to Read a Timeseries in Python

There's no ultimate answer to this because there's no getting away from memory vs speed tradeoffs.

One way to get 4x "more" memory would be to convert your NDVI arrays from float to signed Int16. You'd still have 4 decimal places precision, and a lot of the float precision was false anyway since the source imagery would have been (I presume) 8 or 16 bit to begin with. Do your math in integer space and convert to floats at the end.

Another way to get more memory would be to distribute the problem over several machines. If your problem allows you to analyze the images one at a time, it should be distributable.

Also, don't overlook Rasterio, https://github.com/mapbox/rasterio, a newer and better Python GDAL library for reading and writing raster data.

Best Answer

Related Solutions

Write Numpy Array to Raster File – Converting Numpy Arrays to Raster Format

[GIS] Memory and Processing Efficient Approach to Read a Timeseries in Python

Related Question