PostGIS Nearest Points – Finding Nearest Points with ST_Distance and kNN

distancenearest neighborpostgis

I need to obtain on each element on one table the closest point of another table. The first table contains traffic signs and the second one the Entrance Halls of the town.
The thing is that I can't use ST_ClosestPoint function and I have to use ST_Distance function and get the min(ST_distance) record but I am quite stuck building the query.

CREATE TABLE traffic_signs
(
  id numeric(8,0) ),
  "GEOMETRY" geometry,
  CONSTRAINT traffic_signs_pkey PRIMARY KEY (id),
  CONSTRAINT traffic_signs_id_key UNIQUE (id)
)
WITH (
  OIDS=TRUE
);

CREATE TABLE entrance_halls
(
  id numeric(8,0) ),
  "GEOMETRY" geometry,
  CONSTRAINT entrance_halls_pkey PRIMARY KEY (id),
  CONSTRAINT entrance_halls_id_key UNIQUE (id)
)
WITH (
  OIDS=TRUE
);

I need to obtain the id of the closest entrnce_hall of every traffic_sign.

My query so far:

SELECT senal.id,port.id,ST_Distance(port."GEOMETRY",senal."GEOMETRY")  as dist
    FROM traffic_signs As senal, entrance_halls As port   
    ORDER BY senal.id,port.id,ST_Distance(port."GEOMETRY",senal."GEOMETRY")

With this I am getting the distance from every traffic_sign to every entrance_hall but how can I get only the minimun distance?

Best Answer

You are nearly there. There is a little trick which is to use PostgreSQL's distinct operator, which will return the first match of each combination -- as you are ordering by ST_Distance, effectively it will return the closest point from each senal to each port.

SELECT 
   DISTINCT ON (senal.id) senal.id, port.id, ST_Distance(port."GEOMETRY", senal."GEOMETRY")  as dist
FROM traffic_signs As senal, entrance_halls As port   
ORDER BY senal.id, port.id, ST_Distance(port."GEOMETRY", senal."GEOMETRY");

If you know that the minimum distance in each case is no more than some amount x, (and you have a spatial index on your tables), you can speed this up by putting a WHERE ST_DWithin(port."GEOMETRY", senal."GEOMETRY", distance), eg, if all the minumum distances are known to be no more than 10km, then:

SELECT 
   DISTINCT ON (senal.id) senal.id, port.id, ST_Distance(port."GEOMETRY", senal."GEOMETRY")  as dist
FROM traffic_signs As senal, entrance_halls As port  
WHERE ST_DWithin(port."GEOMETRY", senal."GEOMETRY", 10000) 
ORDER BY senal.id, port.id, ST_Distance(port."GEOMETRY", senal."GEOMETRY");

Obviously, this needs to be used with caution, as if the minimum distance is greater, you will simply get no row for that combination of senal and port.

The order by order must match the distinct on order, which makes sense, as distinct is taking the first distinct group based on some ordering.

It is assumed that you have a spatial index on both tables.

There is another option, which is to use PostgreSQL's <-> and <#> operators, (center point and bounding box distance calculations, respectively) which make more efficient use of the spatial index and don't require the ST_DWithin hack to avoid n^2 comparisons. There is a good blog article explaining how they work. The general thing to note is that these two operators work in the ORDER BY clause.

SELECT senal.id, 
  (SELECT port.id 
   FROM entrance_halls as port 
   ORDER BY senal.geom <#> port.geom LIMIT 1)
FROM  traffic_signs as senal;

As this question has received a lot of attention and k-nearest neighbours (kNN) is generally a hard problem (in terms of algorithmic run-time) in GIS, it seems worthwhile to expand somewhat on the original scope of this question.

The standard way for find the x nearest neighbours of one object is to use a LATERAL JOIN (conceptually similar to a for each loop). Borrowing shamelessly from @dbaston's answer, you would do something like:

SELECT
  signs.id,
  closest_port.id,
  closest_port.dist
 FROM traffic_signs
CROSS JOIN LATERAL 
  (SELECT
      id, 
      ST_Distance(ports.geom, signs.geom) as dist
      FROM ports
      ORDER BY signs.geom <-> ports.geom
     LIMIT 1
   ) AS closest_port

So, if you want to find the nearest 10 ports, ordered by distance, you simply have to change the LIMIT clause in the lateral sub-query. This is much harder to do without LATERAL JOINS and involves using ARRAY type logic. While this approach works well, it can be sped up enormously if you know you only have to search out to a given distance. In this instance, you can use ST_DWithin(signs.geom, ports.geom, 1000) in the subquery, which because of the way indexing works with the <-> operator -- one of the geometries should be a constant, rather than a column reference -- may be much faster. So, for example, to get the 3 nearest ports, within 10km, you could write something like the following.

 SELECT
  signs.id,
  closest_port.id,
  closest_port.dist
 FROM traffic_signs
CROSS JOIN LATERAL 
  (SELECT
      id, 
      ST_Distance(ports.geom, signs.geom) as dist
      FROM ports
      WHERE ST_DWithin(ports.geom, signs.geom, 10000)
      ORDER BY ST_Distance(ports.geom, signs.geom)
     LIMIT 3
   ) AS closest_port;

As always, usage will vary depending on your data distribution and queries, so EXPLAIN is your best friend.

Finally, there is a minor gotcha, if using LEFT instead of CROSS JOIN LATERAL in that you have to add ON TRUE after the lateral queries alias, eg,

SELECT
  signs.id,
  closest_port.id,
  closest_port.dist
 FROM traffic_signs
LEFT JOIN LATERAL 
  (SELECT
      id, 
      ST_Distance(ports.geom, signs.geom) as dist
      FROM ports          
      ORDER BY signs.geom <-> ports.geom
      LIMIT 1
   ) AS closest_port
   ON TRUE;