[GIS] PostGIS / OSM: Faster query to find nearest line of points

nearest neighboropenstreetmapoptimizationpostgis-2.2query

Following scenario: I have a table roads, which basically contains the road network of a whole country extracted from the OSM line table and a table points containing millions of GPS tracked points which are part of tracks. Both have a id column, a PostGIS geometry geom column with types LineString (roads) and Point (points) as well a spatial index on those columns. The reference system has SRID 4326. The geometry index is clustered.

Versions: Postgres 9.5, PostGIS 2.2.2

I want to find out which road is the closest for each point. I came up with the following as proposed here: How to find the nearest point by using PostGIS function?

SELECT point.id, road.id, ST_Distance(road.geom, point.geom) AS Distance 
FROM roads AS road, points AS point
WHERE road.id = ( 
   SELECT r.id
   FROM roads as r, points as p
   WHERE point.id = p.id
   ORDER BY ST_Distance(r.geom, p.geom) ASC LIMIT 1
);

The result of EXPLAIN ANALYZE where the result is limited to 1 point:

"Limit  (cost=2171519.51..4343038.72 rows=1 width=186) (actual time=7944.471..7944.471 rows=1 loops=1)"
"  ->  Nested Loop  (cost=2171519.51..4578064121748.85 rows=2108230 width=186) (actual time=7944.470..7944.470 rows=1 loops=1)"
"        ->  Seq Scan on points point  (cost=0.00..78098.30 rows=2108230 width=36) (actual time=0.006..0.006 rows=1 loops=1)"
"        ->  Index Scan using roads_id_index on roads road  (cost=2171519.51..2171519.95 rows=1 width=150) (actual time=0.010..0.010 rows=1 loops=1)"
"              Index Cond: (id = (SubPlan 1))"
"              SubPlan 1"
"                ->  Limit  (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
"                      ->  Sort  (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
"                            Sort Key: (st_distance(r.geom, p.geom))"
"                            Sort Method: top-N heapsort  Memory: 25kB"
"                            ->  Nested Loop  (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
"                                  ->  Index Scan using points_pkey on points p  (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
"                                        Index Cond: (point.id = id)"
"                                  ->  Seq Scan on roads r  (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
"              SubPlan 1"
"                ->  Limit  (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
"                      ->  Sort  (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
"                            Sort Key: (st_distance(r.geom, p.geom))"
"                            Sort Method: top-N heapsort  Memory: 25kB"
"                            ->  Nested Loop  (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
"                                  ->  Index Scan using points_pkey on points p  (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
"                                        Index Cond: (point.id = id)"
"                                  ->  Seq Scan on de_road_network r  (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
"Planning time: 0.426 ms"
"Execution time: 7944.560 ms"

This query works fine, but i have an efficiency problem.

The query above takes more than 7 seconds for one point. I think that there is no way to speed this up with the index, since the ST_Distance function cannot use the index. Anyway, since i would like to do this for (~ 3) millions points i would like to speed up the query or to develop a faster one.

Do you have any suggestions on that ?

Some thoughts:

Is this solution accurate ? What happens if a point p has a closest road r1 but p is not in the bounding box of r1 but in the bounding box of r2 ? Is this scenario feasible ?

  • Using the <-> operator: I tried to replace the ST_Distance with the <-> operator but without success. The spatial index is not involved. I guess it is because

Index only kicks in if one of the geometries is a constant (not in a subquery/cte). e.g. 'SRID=3005;POINT(1011102 450541)'::geometry instead of a.geom

  • Including a pre-filtering step using other postgis functions ? E.g. with the ST_DWithin function. (See: Find nearest neighbours faster in PostGIS). How would a query look like if i want to specify the distance in meters ?

  • Would it probably make sense to turn this problem into a point in polygon problem by transforming each road into a polygon with a fixed width followed by a point in polygon test ? I guess a lot of points would only match one polygon. And for those which do not, i could make the distance test.

  • Get an approximative solution (e.g. Using a radial sweep-line as proposed here: Find Nearest Line Segments to Point)

Solution – Final Query: Using CROSS JOIN LATERAL (Credits to Tyler)

SELECT p.id AS point_id, b.id AS road_id, ST_DISTANCE(b.geom, p.geom) AS distance
FROM points p
CROSS JOIN LATERAL (
    SELECT r.id AS id, r.geom AS geom
    FROM roads r
    ORDER BY r.geom <-> p.geom 
    LIMIT 1
) b;

Best Answer

Would you be able to add the result of putting "EXPLAIN ANALYZE" before the query in your question? Then I can update my answer with suggestions.

Of course, you will need GIST indexes on your table's geometry fields and vacuum analyze first.

CREATE INDEX ON road USING gist (geom);
CREATE INDEX ON point USING gist (geom);
VACUUM ANALYZE road;
VACUUM ANALYZE point;

You could try something like this, at the very least it might give you some ideas which may result in you finding your answer.

SELECT p.id, cjl.id
FROM point p
CROSS JOIN LATERAL (
    SELECT r.id 
    FROM road r
    ORDER BY r.geom <-> p.geom 
    LIMIT 1
) cjl
WHERE p.id = 1

Explanation:

I personally enjoy using CROSS JOIN LATERAL because I can pass in the parent table's column to filter and limit the results. The <-> operator is used to increase performance for doing nearest neighbor approximate distance ordering - the performance gains far outweigh the "approximate" distance in terms of accuracy.

Resources:

http://postgis.net/docs/geometry_distance_knn.html http://postgis.net/docs/geometry_distance_centroid.html https://www.postgresql.org/docs/9.3/static/queries-table-expressions.html#QUERIES-LATERAL