Following scenario: I have a table roads
, which basically contains the road network of a whole country extracted from the OSM line
table and a table points
containing millions of GPS tracked points which are part of tracks. Both have a id
column, a PostGIS geometry geom
column with types LineString
(roads) and Point
(points) as well a spatial index on those columns. The reference system has SRID 4326. The geometry index is clustered.
Versions: Postgres 9.5, PostGIS 2.2.2
I want to find out which road is the closest for each point. I came up with the following as proposed here: How to find the nearest point by using PostGIS function?
SELECT point.id, road.id, ST_Distance(road.geom, point.geom) AS Distance
FROM roads AS road, points AS point
WHERE road.id = (
SELECT r.id
FROM roads as r, points as p
WHERE point.id = p.id
ORDER BY ST_Distance(r.geom, p.geom) ASC LIMIT 1
);
The result of EXPLAIN ANALYZE
where the result is limited to 1 point:
"Limit (cost=2171519.51..4343038.72 rows=1 width=186) (actual time=7944.471..7944.471 rows=1 loops=1)"
" -> Nested Loop (cost=2171519.51..4578064121748.85 rows=2108230 width=186) (actual time=7944.470..7944.470 rows=1 loops=1)"
" -> Seq Scan on points point (cost=0.00..78098.30 rows=2108230 width=36) (actual time=0.006..0.006 rows=1 loops=1)"
" -> Index Scan using roads_id_index on roads road (cost=2171519.51..2171519.95 rows=1 width=150) (actual time=0.010..0.010 rows=1 loops=1)"
" Index Cond: (id = (SubPlan 1))"
" SubPlan 1"
" -> Limit (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
" -> Sort (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
" Sort Key: (st_distance(r.geom, p.geom))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Nested Loop (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
" -> Index Scan using points_pkey on points p (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
" Index Cond: (point.id = id)"
" -> Seq Scan on roads r (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
" SubPlan 1"
" -> Limit (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
" -> Sort (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
" Sort Key: (st_distance(r.geom, p.geom))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Nested Loop (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
" -> Index Scan using points_pkey on points p (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
" Index Cond: (point.id = id)"
" -> Seq Scan on de_road_network r (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
"Planning time: 0.426 ms"
"Execution time: 7944.560 ms"
This query works fine, but i have an efficiency problem.
The query above takes more than 7 seconds for one point. I think that there is no way to speed this up with the index, since the ST_Distance
function cannot use the index. Anyway, since i would like to do this for (~ 3) millions points i would like to speed up the query or to develop a faster one.
Do you have any suggestions on that ?
Some thoughts:
- Using the
&&
oeprator, which uses the index. Changing theWHERE
clause of the sub-query above to:WHERE point.id = p.id AND p.geom && r.geom
results in a massive increase in speed. See: Nearest Neighbor problem in Postgis 2.0 using GIST Index (<-> function) .
Is this solution accurate ? What happens if a point p
has a closest road r1
but p
is not in the bounding box of r1
but in the bounding box of r2
? Is this scenario feasible ?
- Using the
<->
operator: I tried to replace theST_Distance
with the<->
operator but without success. The spatial index is not involved. I guess it is because
Index only kicks in if one of the geometries is a constant (not in a subquery/cte). e.g. 'SRID=3005;POINT(1011102 450541)'::geometry instead of a.geom
-
Including a pre-filtering step using other postgis functions ? E.g. with the
ST_DWithin
function. (See: Find nearest neighbours faster in PostGIS). How would a query look like if i want to specify the distance in meters ? -
Would it probably make sense to turn this problem into a point in polygon problem by transforming each road into a polygon with a fixed width followed by a point in polygon test ? I guess a lot of points would only match one polygon. And for those which do not, i could make the distance test.
-
Get an approximative solution (e.g. Using a radial sweep-line as proposed here: Find Nearest Line Segments to Point)
Solution – Final Query: Using CROSS JOIN LATERAL
(Credits to Tyler)
SELECT p.id AS point_id, b.id AS road_id, ST_DISTANCE(b.geom, p.geom) AS distance
FROM points p
CROSS JOIN LATERAL (
SELECT r.id AS id, r.geom AS geom
FROM roads r
ORDER BY r.geom <-> p.geom
LIMIT 1
) b;
Best Answer
Would you be able to add the result of putting "EXPLAIN ANALYZE" before the query in your question? Then I can update my answer with suggestions.Of course, you will need GIST indexes on your table's geometry fields and vacuum analyze first.
You could try something like this, at the very least it might give you some ideas which may result in you finding your answer.
Explanation:
I personally enjoy using CROSS JOIN LATERAL because I can pass in the parent table's column to filter and limit the results. The <-> operator is used to increase performance for doing nearest neighbor approximate distance ordering - the performance gains far outweigh the "approximate" distance in terms of accuracy.
Resources:
http://postgis.net/docs/geometry_distance_knn.html http://postgis.net/docs/geometry_distance_centroid.html https://www.postgresql.org/docs/9.3/static/queries-table-expressions.html#QUERIES-LATERAL