PostGIS – Query for Selecting Number of Points per Polygon and Year

optimizationpoint-in-polygonpostgispostgresqlsql

I have two tables in a PostGIS database:

rides, containing about 3 million datasets:

  • ride_uid (unique integer)
  • start_geom (point geometry with index in 4326)
  • end_geom (point geometry with index in 4326)
  • start_date_year (integer)
  • end_date_year (integer)
  • a lot more stuff

This rides table also has some invalid geometries in it, namely NULL values in start_geom or end_geom or non-valid geometries laying somewhere near north pole or sahara dessert. So I'd check for that using where ST_Within(st.start_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326)) to prevent a reprojection error.

polygons, containing about 30 datasets:

  • uid_key (string)

  • name (string)

  • geometry (multipolygons with index in 25832)

  • a lot more stuff

I want to figure out the start's and end's for all polygon's and a given range of years. I also want to consider 0 rides in a year/polygon. E.g. as result:

PolygonUID Year N_Starts N_Ends
A 2015 55 33
A 2016 0 22
A 2017 1001 0
B 2015 63 12
B 2016 10 666
B 2017 0 0

I am struggling to find an efficient query to achieve my goal in one go. This is what I got so far; It returns my desired result for only starts or only ends in about ~4-5 minutes (which is acceptable).

select
hlpr.uid_key,
hlpr.name,
hlpr.y,
coalesce(count(s.ride_uid),0) as cnt_start --,
--coalesce(count(e.ride_uid),0) as cnt_end -- takes forever with this at the same time
from
(
    select
    poly.uid_key,
    poly.name,
    cal.y,
    poly.geom
    from schema_b.polygons as poly
    cross join
        (
            select cast(generate_series(2015,2021) as integer) as y
        ) as cal
) as hlpr

left join
(
    select
    st.ride_uid,
    st.start_date_year,
    st.start_geom,
    pol.uid_key
    from schema_a.rides as st
    left join
    (
        select
        uid_key,
        geom
        from schema_b.polygons
    ) as pol
    on ST_Intersects(ST_Transform(st.start_geom,25832),pol.geom)
    where ST_Within(st.start_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
) as s
on s.start_date_year = hlpr.y and s.uid_key = hlpr.uid_key

-- takes forever with this:
/*
left join
(
    select
    en.ride_uid,
    en.end_date_year,
    en.end_geom,
    pol.uid_key
    from schema_a.rides as en
    left join
    (
        select
        uid_key,
        geom
        from schema_b.polygons
    ) as pol
    on ST_Intersects(ST_Transform(en.end_geom,25832),pol.geom)
    where ST_Within(en.end_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
) as e
on e.end_date_year = hlpr.y and e.uid_key = hlpr.uid_key
*/

group by
hlpr.uid_key,
hlpr.name,
hlpr.y

order by
hlpr.uid_key,
hlpr.y

Can I optimize this query somehow to get my desired result faster in one go? Or is this an unusual approach that should be avoided at all and instead only one count be done per query and the results merged manually afterwards?

Best Answer

Rather than a series of set-wise JOINs (that necessarily creates extremely inefficient cartesian products), use a correlated expression during a traversal of your polygons table - that way you can fully benefit from indexes on both rides.start_geom & rides.end_geom.

If possible, I favor a LATERAL expression:

SELECT  ply.uid_key AS "PolygonUID",
        _y AS "Year",
        st._cnt AS "N_Starts",
        et._cnt AS "N_Ends"
FROM    polygons AS ply
CROSS JOIN
        GENERATE_SERIES(2015, 2021) AS _y
CROSS JOIN LATERAL (
    SELECT COUNT(*) AS _cnt
    FROM   rides AS t
    WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.start_geom)
      AND  t.start_date_year = _y
) AS st
CROSS JOIN LATERAL (
    SELECT COUNT(*) AS _cnt
    FROM   rides AS t
    WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.end_geom)
      AND  t.end_date_year = _y
) AS et
;

Here in full verbose mode - all CROSS JOINs can literally be replaced by a ,.

For each row in polygons

  • we create a cross product with the desired range of years (GENERATE_SERIES), virtually increasing row count in the set
  • from which we then take the initial rows polygon.geom and the cross joined year _y, and pass those to both LATERAL queries
  • which, in turn, will run a highly performant COUNT on the rides table each, utilizing its spatial indexes

Note that the key here is to pass each traversed polygon.geom into the indexes on rides - meaning that you need to transform polygon.geom to match the SRID of both rides geometries. Not being able to utilize the spatial indexes due to the transformation is a major issue in your query. This way there is now no need to run a pre-filter by envelope.


This is effectively equivalent to using an actual correlated sub-query per value set (start & end) per row:

SELECT  ply.uid_key AS "PolygonUID",
        _y AS "Year",
        (
          SELECT COUNT(*) AS _cnt
          FROM   rides AS t
          WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.start_geom)
            AND  t.start_date_year = y
        ) AS "N_Starts",
        (
          SELECT COUNT(*) AS _cnt
          FROM   rides AS t
          WHERE  ST_Intersects(ST_Transform(ply.geom, 4326), t.end_geom)
            AND  t.end_date_year = y
        ) AS "N_Ends"
FROM    regions AS ply
CROSS JOIN
        GENERATE_SERIES(2015, 2021) AS _y
;

but these two concepts may use fundamentally different query plans, with different benefits based on use case. Here, this latter query should be slightly slower.