I have two tables in a PostGIS database:
rides
, containing about 3 million datasets:
ride_uid
(unique integer)start_geom
(point geometry with index in 4326)end_geom
(point geometry with index in 4326)start_date_year
(integer)end_date_year
(integer)- a lot more stuff
This rides
table also has some invalid geometries in it, namely NULL
values in start_geom
or end_geom
or non-valid geometries laying somewhere near north pole or sahara dessert. So I'd check for that using where ST_Within(st.start_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
to prevent a reprojection error.
polygons
, containing about 30 datasets:
-
uid_key
(string) -
name
(string) -
geometry
(multipolygons with index in 25832) -
a lot more stuff
I want to figure out the start's and end's for all polygon's and a given range of years. I also want to consider 0 rides in a year/polygon. E.g. as result:
PolygonUID | Year | N_Starts | N_Ends |
---|---|---|---|
A | 2015 | 55 | 33 |
A | 2016 | 0 | 22 |
A | 2017 | 1001 | 0 |
B | 2015 | 63 | 12 |
B | 2016 | 10 | 666 |
B | 2017 | 0 | 0 |
I am struggling to find an efficient query to achieve my goal in one go. This is what I got so far; It returns my desired result for only starts or only ends in about ~4-5 minutes (which is acceptable).
select
hlpr.uid_key,
hlpr.name,
hlpr.y,
coalesce(count(s.ride_uid),0) as cnt_start --,
--coalesce(count(e.ride_uid),0) as cnt_end -- takes forever with this at the same time
from
(
select
poly.uid_key,
poly.name,
cal.y,
poly.geom
from schema_b.polygons as poly
cross join
(
select cast(generate_series(2015,2021) as integer) as y
) as cal
) as hlpr
left join
(
select
st.ride_uid,
st.start_date_year,
st.start_geom,
pol.uid_key
from schema_a.rides as st
left join
(
select
uid_key,
geom
from schema_b.polygons
) as pol
on ST_Intersects(ST_Transform(st.start_geom,25832),pol.geom)
where ST_Within(st.start_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
) as s
on s.start_date_year = hlpr.y and s.uid_key = hlpr.uid_key
-- takes forever with this:
/*
left join
(
select
en.ride_uid,
en.end_date_year,
en.end_geom,
pol.uid_key
from schema_a.rides as en
left join
(
select
uid_key,
geom
from schema_b.polygons
) as pol
on ST_Intersects(ST_Transform(en.end_geom,25832),pol.geom)
where ST_Within(en.end_geom,ST_MakeEnvelope(5, 35, 20, 55, 4326))
) as e
on e.end_date_year = hlpr.y and e.uid_key = hlpr.uid_key
*/
group by
hlpr.uid_key,
hlpr.name,
hlpr.y
order by
hlpr.uid_key,
hlpr.y
Can I optimize this query somehow to get my desired result faster in one go? Or is this an unusual approach that should be avoided at all and instead only one count be done per query and the results merged manually afterwards?
Best Answer
Rather than a series of set-wise
JOIN
s (that necessarily creates extremely inefficient cartesian products), use a correlated expression during a traversal of yourpolygons
table - that way you can fully benefit from indexes on bothrides.start_geom
&rides.end_geom
.If possible, I favor a
LATERAL
expression:Here in full verbose mode - all
CROSS JOIN
s can literally be replaced by a,
.For each row in
polygons
GENERATE_SERIES
), virtually increasing row count in the setpolygon.geom
and the cross joined year_y
, and pass those to bothLATERAL
queriesCOUNT
on therides
table each, utilizing its spatial indexesNote that the key here is to pass each traversed
polygon.geom
into the indexes onrides
- meaning that you need to transformpolygon.geom
to match the SRID of bothrides
geometries. Not being able to utilize the spatial indexes due to the transformation is a major issue in your query. This way there is now no need to run a pre-filter by envelope.This is effectively equivalent to using an actual correlated sub-query per value set (start & end) per row:
but these two concepts may use fundamentally different query plans, with different benefits based on use case. Here, this latter query should be slightly slower.