PostGIS – Query for Selecting Number of Points per Polygon and Year

optimizationpoint-in-polygonpostgispostgresqlsql

I have two tables in a PostGIS database:

rides, containing about 3 million datasets:

ride_uid (unique integer)
start_geom (point geometry with index in 4326)
end_geom (point geometry with index in 4326)
start_date_year (integer)
end_date_year (integer)
a lot more stuff

This rides table also has some invalid geometries in it, namely NULL values in start_geom or end_geom or non-valid geometries laying somewhere near north pole or sahara dessert. So I'd check for that using where ST_Within(st.start_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326)) to prevent a reprojection error.

polygons, containing about 30 datasets:

uid_key (string)
name (string)
geometry (multipolygons with index in 25832)
a lot more stuff

I want to figure out the start's and end's for all polygon's and a given range of years. I also want to consider 0 rides in a year/polygon. E.g. as result:

PolygonUID	Year	N_Starts	N_Ends
A	2015	55	33
A	2016	0	22
A	2017	1001	0
B	2015	63	12
B	2016	10	666
B	2017	0	0

I am struggling to find an efficient query to achieve my goal in one go. This is what I got so far; It returns my desired result for only starts or only ends in about ~4-5 minutes (which is acceptable).

select
hlpr.uid_key,
hlpr.name,
hlpr.y,
coalesce(count(s.ride_uid),0) as cnt_start --,
--coalesce(count(e.ride_uid),0) as cnt_end -- takes forever with this at the same time
from
(
    select
    poly.uid_key,
    poly.name,
    cal.y,
    poly.geom
    from schema_b.polygons as poly
    cross join
        (
            select cast(generate_series(2015,2021) as integer) as y
        ) as cal
) as hlpr

left join
(
    select
    st.ride_uid,
    st.start_date_year,
    st.start_geom,
    pol.uid_key
    from schema_a.rides as st
    left join
    (
        select
        uid_key,
        geom
        from schema_b.polygons
    ) as pol
    on ST_Intersects(ST_Transform(st.start_geom,25832),pol.geom)
    where ST_Within(st.start_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
) as s
on s.start_date_year = hlpr.y and s.uid_key = hlpr.uid_key

-- takes forever with this:
/*
left join
(
    select
    en.ride_uid,
    en.end_date_year,
    en.end_geom,
    pol.uid_key
    from schema_a.rides as en
    left join
    (
        select
        uid_key,
        geom
        from schema_b.polygons
    ) as pol
    on ST_Intersects(ST_Transform(en.end_geom,25832),pol.geom)
    where ST_Within(en.end_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
) as e
on e.end_date_year = hlpr.y and e.uid_key = hlpr.uid_key
*/

group by
hlpr.uid_key,
hlpr.name,
hlpr.y

order by
hlpr.uid_key,
hlpr.y

Can I optimize this query somehow to get my desired result faster in one go? Or is this an unusual approach that should be avoided at all and instead only one count be done per query and the results merged manually afterwards?

Best Answer

Rather than a series of set-wise JOINs (that necessarily creates extremely inefficient cartesian products), use a correlated expression during a traversal of your polygons table - that way you can fully benefit from indexes on both rides.start_geom & rides.end_geom.

If possible, I favor a LATERAL expression:

SELECT  ply.uid_key AS "PolygonUID",
        _y AS "Year",
        st._cnt AS "N_Starts",
        et._cnt AS "N_Ends"
FROM    polygons AS ply
CROSS JOIN
        GENERATE_SERIES(2015, 2021) AS _y
CROSS JOIN LATERAL (
    SELECT COUNT(*) AS _cnt
    FROM   rides AS t
    WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.start_geom)
      AND  t.start_date_year = _y
) AS st
CROSS JOIN LATERAL (
    SELECT COUNT(*) AS _cnt
    FROM   rides AS t
    WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.end_geom)
      AND  t.end_date_year = _y
) AS et
;

Here in full verbose mode - all CROSS JOINs can literally be replaced by a ,.

For each row in polygons

we create a cross product with the desired range of years (GENERATE_SERIES), virtually increasing row count in the set
from which we then take the initial rows polygon.geom and the cross joined year _y, and pass those to both LATERAL queries
which, in turn, will run a highly performant COUNT on the rides table each, utilizing its spatial indexes

Note that the key here is to pass each traversed polygon.geom into the indexes on rides - meaning that you need to transform polygon.geom to match the SRID of both rides geometries. Not being able to utilize the spatial indexes due to the transformation is a major issue in your query. This way there is now no need to run a pre-filter by envelope.

This is effectively equivalent to using an actual correlated sub-query per value set (start & end) per row:

SELECT  ply.uid_key AS "PolygonUID",
        _y AS "Year",
        (
          SELECT COUNT(*) AS _cnt
          FROM   rides AS t
          WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.start_geom)
            AND  t.start_date_year = y
        ) AS "N_Starts",
        (
          SELECT COUNT(*) AS _cnt
          FROM   rides AS t
          WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.end_geom)
            AND  t.end_date_year = y
        ) AS "N_Ends"
FROM    regions AS ply
CROSS JOIN
        GENERATE_SERIES(2015, 2021) AS _y
;

but these two concepts may use fundamentally different query plans, with different benefits based on use case. Here, this latter query should be slightly slower.

Related Solutions

[GIS] Optimising a very large point in polygon query

To answer your last question first, see this post about the desirability of being able to monitor the progress of queries. The problem is difficult and would be compounded in a spatial query, as knowing that 99% of the addresses had already been scanned for containment in a flood polygon, which you could get from the loop counter in the underlying table scan implementation, would not necessarily help if the final 1% of addresses happen to intersect a flood polygon with the most points, while the previous 99% intersect some tiny area. This is one of the reasons why EXPLAIN can sometimes be unhelpful with spatial, as it gives an indication of the rows that will be scanned, but, for obvious reasons, does not take into account the complexity of the polygons (and hence a large proportion of the run time) of any intersects/intersection type queries.

A second problem is that if you look at something like

EXPLAIN 
SELECT COUNT(a.id) 
FROM sometable a, someothertable b
WHERE ST_Intersects (a.geom, b.geom)

you will see something like, after missing out lots of details:

_st_intersects(a.geom, b.geom)
   ->  Bitmap Index Scan on ix_spatial_index_name  (cost...rows...width...))
   Index Cond: (a.geom && geom)

The final condition, &&, means do a bounding box check, before doing any more accurate intersection of the actual geometries. This is obviously sensible and at the core of how R-Trees work. However, and I have also worked on UK flood data in the past, so am familiar with the structure of the data, if the (Multi)Polygons are very extensive -- this problem is particularly acute if a river runs at, say, 45 degrees -- you get huge bounding boxes, which might force huge numbers of potential intersections to be checked on very complex polygons.

The only solution I have been able to come up with for the "my query has been running for 3 days and I don't know if we are at 1% or 99%" problem is to use a kind of divide and conquer for dummies approach, by which I mean, break your area into smaller chunks, and run those separately, either in a loop in plpgsql or explicitly in the console. This has the advantage of cutting complex polygons into parts, which means subsequent point in polygon checks are working on smaller polygons and the polygons' bounding boxes are much smaller.

I have managed to run queries in a day by breaking the UK into 50km by 50km blocks, after killing a query that had been running for over a week on the whole UK. As an aside, I hope your query above is CREATE TABLE or UPDATE and not just a SELECT. When you are updating one table, addresses, based on being in a flood polygon, you will have to scan the whole table being updated, addresses anyway, so actually having a spatial index on it is of no help at all.

EDIT: On the basis that an image is worth a thousand words, here is an image of some UK flood data. There is one very large multipolygon, the bounding box of which covers that whole area, so it is easy to see how, for example, by first intersecting the flood polygon with the red grid, the square in the southwest corner would suddenly only be tested against a tiny subset of the polygon.

[GIS] Transformation of geometry on select Postgresql

First off:

You can chain most functions directly if their return types are compatible, so no need to subselect within a function
Geometries in textual representation are merely a tool for readability, so no need to have them translated in intermediate steps

Your understanding of the SQL syntax is...problematic, as is the notation ,). Get a good read on general SQL query structure, the w3schools for example offer a quite comprehensive introduction

But, to help you out here; if I assume correctly that you simply want all geometries in table foo transformed into the given projection, and the original geometries are correctly referenced to a CRS, try

SELECT ST_Transform(geom, 32736) AS utm_geom
FROM foo;

to get the binary representation of your transformed geometries (standard format used for storage and geometric analysis within PostGIS) for further tasks, or

SELECT ST_AsText(ST_Transform(geom, 32736)) AS utm_geom
FROM foo;

to get a simple textual representation of your geometries.

If you´d like to create a new table with those geometries, use

CREATE TABLE bar AS
  SELECT ST_Transform(geom, 32736) AS utm_geom
  FROM foo;

(I need to add here: creating tables like above is trivial, maintaining performant database/table structures is not. Among a lot of other things, for one of the next steps in learning PostgreSQL/PostGIS I recommend to read about indexes/table statistics and performance in general)

Filters are most commonly added in the WHERE block, like

SELECT ST_Transform(geom, 32736) AS utm_geom
FROM foo
WHERE ST_IsValid(geom);       --Note: this is equal to 'WHERE ST_IsValid(geom) = TRUE'

to select only those rows/geometries of table foo that are well formed.

Note: in all cases, these queries will only return your geometries, no other column will be added to either the output or the table. To add more columns, name them in the SELECT command, possibly before the geometry, like

SELECT <column_a> AS <new_name_for_a_if_needed>,    --repeat for other columns if necessary
       ST_Transform(geom, 32736) AS utm_geom
FROM foo;

Best Answer

Related Solutions

[GIS] Optimising a very large point in polygon query

[GIS] Transformation of geometry on select Postgresql

Related Question