[GIS] Optimising a very large point in polygon query

I have a national dataset of address points (37 million) and a polygon dataset of flood outlines (2 million) of type MultiPolygonZ, some of the polygons are very complex, max ST_NPoints is around 200,000. I am trying to identify using PostGIS (2.18) which address points are in a flood polygon and write these to a new table with address id and flood risk details. I have tried from an address perspective (ST_Within) but then swapped this starting from the flood area perspective (ST_Contains), rationale being that there are large areas with no flood risk at all. Both datasets have been reprojected to 4326 and both tables have a spatial index. My query below has been running for 3 days now and shows no signs of finishing any time soon!

select a.id, f.risk_factor_1, f.risk_factor_2, f.risk_factor_3
into gb.addresses_with_flood_risk
from gb.flood_risk_areas f, gb.addresses a
where ST_Contains(f.the_geom, a.the_geom);

Is there a more optimal way to run this? Also, for long running queries of this type what is the best way of monitoring progress other than looking at resource utilisation and pg_stat_activity?

My original query finished OK albeit for 3 days and I got sidetracked with other work so I never got to dedicate the time to trying out the solution. However I have just re-visited this and working through the recommendations, so far so good. I have used the following:

Created a 50km grid over the UK using the ST_FishNet solution suggested here
Set the SRID of the generated grid to British National Grid and built a spatial index on it
Clipped my flood data (MultiPolygon) using ST_Intersection and ST_Intersects (only gotcha here was I had to use ST_Force_2D on the geom as shape2pgsql added a Z index
Clipped my point data using the same grid
Created indexes on the row and col and spatial index on each of the tables

I am ready to run my script now, will iterate over the rows and columns populating results into a new table until I have covered the whole country. But just checked my flood data and some of the very largest polygons appear to have been lost in translation! This is my query:

SELECT g.row, g.col, f.gid, f.objectid, f.prob_4band, ST_Intersection(ST_Force_2D(f.geom), g.geom) AS geom 
INTO rofrse.tmp_flood_risk_grid 
FROM rofrse.raw_flood_risk f, rofrse.gb_grid g
WHERE (ST_Intersects(ST_Force_2D(f.geom), g.geom));

My original data looks like this:

However post clipping it looks like this:

This is an example of a "missing" polygon:

Best Answer

To answer your last question first, see this post about the desirability of being able to monitor the progress of queries. The problem is difficult and would be compounded in a spatial query, as knowing that 99% of the addresses had already been scanned for containment in a flood polygon, which you could get from the loop counter in the underlying table scan implementation, would not necessarily help if the final 1% of addresses happen to intersect a flood polygon with the most points, while the previous 99% intersect some tiny area. This is one of the reasons why EXPLAIN can sometimes be unhelpful with spatial, as it gives an indication of the rows that will be scanned, but, for obvious reasons, does not take into account the complexity of the polygons (and hence a large proportion of the run time) of any intersects/intersection type queries.

A second problem is that if you look at something like

EXPLAIN 
SELECT COUNT(a.id) 
FROM sometable a, someothertable b
WHERE ST_Intersects (a.geom, b.geom)

you will see something like, after missing out lots of details:

_st_intersects(a.geom, b.geom)
   ->  Bitmap Index Scan on ix_spatial_index_name  (cost...rows...width...))
   Index Cond: (a.geom && geom)

The final condition, &&, means do a bounding box check, before doing any more accurate intersection of the actual geometries. This is obviously sensible and at the core of how R-Trees work. However, and I have also worked on UK flood data in the past, so am familiar with the structure of the data, if the (Multi)Polygons are very extensive -- this problem is particularly acute if a river runs at, say, 45 degrees -- you get huge bounding boxes, which might force huge numbers of potential intersections to be checked on very complex polygons.

The only solution I have been able to come up with for the "my query has been running for 3 days and I don't know if we are at 1% or 99%" problem is to use a kind of divide and conquer for dummies approach, by which I mean, break your area into smaller chunks, and run those separately, either in a loop in plpgsql or explicitly in the console. This has the advantage of cutting complex polygons into parts, which means subsequent point in polygon checks are working on smaller polygons and the polygons' bounding boxes are much smaller.

I have managed to run queries in a day by breaking the UK into 50km by 50km blocks, after killing a query that had been running for over a week on the whole UK. As an aside, I hope your query above is CREATE TABLE or UPDATE and not just a SELECT. When you are updating one table, addresses, based on being in a flood polygon, you will have to scan the whole table being updated, addresses anyway, so actually having a spatial index on it is of no help at all.

EDIT: On the basis that an image is worth a thousand words, here is an image of some UK flood data. There is one very large multipolygon, the bounding box of which covers that whole area, so it is easy to see how, for example, by first intersecting the flood polygon with the red grid, the square in the southwest corner would suddenly only be tested against a tiny subset of the polygon.

Best Answer

Related Solutions

PostGIS Performance – How to Speed Up Query in PostGIS

Related Question