[GIS] Performance in calculating raster statistics in PostGIS

intersectionpostgispostgresql

I'm trying to calculate raster statistics (min, max, mean) for each polygon in a vector layer using PostgreSQL/PostGIS.

This GIS.SE answer describes how to do this, by calculating the intersection between the polygon and the raster and then calculating a weighted average: https://gis.stackexchange.com/a/19858/12420

I'm using the following query (where dem is my raster, topo_area_su_region is my vector, and toid is a unique ID:

SELECT toid, Min((gv).val) As MinElevation, Max((gv).val) As MaxElevation, Sum(ST_Area((gv).geom) * (gv).val) / Sum(ST_Area((gv).geom)) as MeanElevation FROM (SELECT toid, ST_Intersection(rast, geom) AS gv FROM topo_area_su_region,dem WHERE ST_Intersects(rast, geom)) foo GROUP BY toid ORDER BY toid;

This works, but it's too slow. My vector layer has 2489k features, with each one taking around 90ms to process – it would take days to process the entire layer. The speed of the calculation doesn't seem to be significantly improved if I only calculate the min and max (which avoids the calls to ST_Area).

If I do a similar calculation using Python (GDAL, NumPy and PIL) I can significantly reduce the amount of time it takes to process the data, if instead of vectorizing the raster (using ST_Intersection) I rasterize the vector. See code here: https://gist.github.com/snorfalorpagus/7320167

I don't really need a weighted average – a "if it touches, it's in" approach is good enough – and I'm reasonably sure this is what is slowing things down.

Question: Is there any way to get PostGIS to behave like this? i.e. to return the values of all the cells from the raster that a polygon touches, rather than the exact intersection.

I'm very new to PostgreSQL/PostGIS, so maybe there is something else I'm not doing right. I'm running PostgreSQL 9.3.1 and PostGIS 2.1 on Windows 7 (2.9GHz i7, 8GB RAM) and have tweaked the database config as suggested here: http://postgis.net/workshops/postgis-intro/tuning.html

enter image description here

Best Answer

You're right, using ST_Intersection slows down your query noticeable.

Instead of using ST_Intersection it is better to clip (ST_Clip) your raster with the polygons (your fields) and dump the result as polygons (ST_DumpAsPolygons). So every raster cell will be converted into a little polygon rectangle with distinct values.

For receiving min, max or mean from the dumps you can use the same statements.

This query should do the trick:

SELECT 
    toid,
    Min((gv).val) As MinElevation,
    Max((gv).val) As MaxElevation,
    Sum(ST_Area((gv).geom) * (gv).val) / Sum(ST_Area((gv).geom)) as MeanElevation
FROM (
    SELECT 
        toid,
        ST_DumpAsPolygons(ST_Clip(rast, 1, geom, true)) AS gv
    FROM topo_area_su_region,dem 
        WHERE ST_Intersects(rast, geom)) AS foo 
            GROUP BY toid 
            ORDER BY toid;

In the statement ST_Clip you define the raster, the raster band (=1), the polygon and if the crop should be TRUE or FALSE.

Besides you can use avg((gv).val) to calculate the mean value.

EDIT

The result of your approach is the more exact, but the slower one. The results of the combination of ST_Clip and ST_DumpAsPolygons are ignoring the raster cells that are intersecting with less than 50% (or 51%) of their size.

These two screen shots from a CORINE Land Use intersection show the difference. First picture with ST_Intersection, second one with ST_Clip and ST_DumpAsPolygons.

enter image description here

enter image description here

Related Question