Warning using zonal_stats: RuntimeWarning: overflow encountered in reduce return

pythonzonal statistics

I have a grid of cells (100km x 100km) around the African continent (shapefile) . I am doing a zonal_stats with the population of each cell. It is a raster data (geotiff) with resolution 1kmx1km.

This is the code I am using:

import geopandas as gpd
from pyproj import CRS
import rasterio
from rasterstats import zonal_stats
from rasterio.mask import mask

pop_array, pop_transform = mask(pop, shapes=grid.geometry, crop=True)

zs1 = zonal_stats(grid, pop_array[0], affine=pop_transform, stats=['mean'], nodata=np.nan)
grid['pop'] = [x['mean'] for x in zs1]

where pop is my population raster file and the grid is my shapefile of cells. Both of them have the same crs. However, the following warning appears:

RuntimeWarning: overflow encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)

Some of the values of the resulting column "pop" has a value "-inf", and others have a very high negative value, which is surprising for me given that the raster is population count. I have looked at a possible problem and it could be that I need to specify a greater dtype. For example, with astype.

Given that, I have tried to add the following in the last line of my code:

grid['pop'] = [x['mean'] for x in zs1].astype(float128)

But it's still not working with the following error: "AttributeError: 'list' object has no attribute 'astype'"

Any idea about how can I solve it?

"Pop" is the name I assigned to the new column with the results from the zonal stats. It has dtype "float64". When I do pop_array.dtype(), it says to me the following:

TypeError: 'numpy.dtype[float32]' object is not callable

Best Answer

The source nodata value (nan in this case) needs to be defined or supplied as an arg to rasterio.mask.mask(), so in addition to masking out the geometry it also masks out the nodata value and doesn't attempt to include nodata in the calculations.


However I wouldn't expect it to ever overflow, so that's strange. If nans were being included in the calculation the result should be nan as well.

It's possible that somewhere along the line your data was being cast to an integer type. nan is essentially a special-case floating point value which it doesn't apply to integers, so naively treating nan as an int will give you wacky values. It seems like np.mean() is trying to sum all those wacky values and overflowing.

I'm not sure if this is a rasterstats bug or something funky with your specific dataset, like the raster dtype being incorrect or something. To properly debug that I'd probably need your actual input data and the exact versions of all the libraries you're using.

Related Question