PostGIS – Mapping Points to Nearest Area Using PostgreSQL

knnpostgispostgresql

I have a bunch of UK postcodes, which are basically lat/lng points. I currently map these to user-defined "areas" as follows:

  1. Create a very small box around the point.
  2. Search for areas which intersect that.
  3. If I find any, choose the one with the smallest area.
  4. If I don't, step up the box and try again.

Some notes:

  • Point 3 is crucial. One area might be "England"; another might be "Camden". I want to map to Camden, not England.
  • The areas have various geometries, though I could make them all polygons if need be.
  • The areas may overlap. There are gaps between areas.
  • A point may therefore be within one or more areas, in which case I want the smallest, or not within any, in which case I want the closest.
  • I have min and max bounds I want to apply to the size of the areas, but if need be I could do that outside the scope of this question by pre-filtering the data.

What I have currently, using MySQL and spatial indexes, is like this (edited for brevity):

    $swlat = round($loc['lat'], 2) - self::LOCATION_STEP;
    $swlng = round($loc['lng'], 2) - self::LOCATION_STEP;
    $nelat = round($loc['lat'], 2) + self::LOCATION_STEP;
    $nelng = round($loc['lng'], 2) + self::LOCATION_STEP;

    $count = 0;

    do {
        $poly = "POLYGON(($swlng $swlat, $swlng $nelat, $nelng $nelat, $nelng $swlat, $swlng $swlat))";

        # We want to find the smallest nearby or containing area.
        #
        # Exclude locations which are very large, e.g. Greater London or too small (probably just a
        # single building.
        $sql = "SELECT locations.id, locations.name, locations.maxdimension, ST_AsText(locations_spatial.geometry) AS geom,
CASE WHEN ST_Intersects(locations_spatial.geometry, ST_GeomFromText('POINT({$loc['lng']} {$loc['lat']})', {$this->dbhr->SRID()})) THEN 0
ELSE ST_Distance(locations_spatial.geometry, ST_GeomFromText('POINT({$loc['lng']} {$loc['lat']})', {$this->dbhr->SRID()})) END AS dist
FROM locations_spatial INNER JOIN `locations` ON locations.id = locations_spatial.locationid 
LEFT OUTER JOIN locations_excluded ON locations_excluded.locationid = locations.id 
WHERE 
  ST_Intersects(locations_spatial.geometry, ST_GeomFromText('$poly', {$this->dbhr->SRID()})) AND type != 'Postcode' 
  AND ST_Dimension(locations_spatial.geometry) = 2 AND locations_excluded.locationid IS NULL 
  AND locations_spatial.locationid != $id AND maxdimension < " . self::TOO_LARGE . " AND maxdimension > " . self::TOO_SMALL . " 
ORDER BY maxdimension ASC, dist ASC LIMIT 1;";
        $nearbyes = $this->dbhr->preQuery($sql);

        if (count($nearbyes) === 0) {
            $swlat -= self::LOCATION_STEP;
            $swlng -= self::LOCATION_STEP;
            $nelat += self::LOCATION_STEP;
            $nelng += self::LOCATION_STEP;
            $count++;
        }

    } while (count($nearbyes) == 0 && $count < self::LOCATION_MAX);

…but it is slow and I would like to speed it up.

What is an efficient way of doing this nowadays? More specifically, is there a way to avoid this loop approach (either using PostGIS or other tech)?

I'm familiar with the KNN problem, but I don't think this is exactly the same. I have dabbled with PostgreSQL/PostGIS, but I'm not familiar in depth with what is/isn't efficient there.

Best Answer

Given your note

A point may therefore be within one or more areas, in which case I want the smallest, or not within any, in which case I want the closest.

things are simple: rank the ordered distance to nearby (or containing) areas, and find the smallest one in that set:

SELECT id,
       geom
FROM   (
  SELECT id,
         geom,
         RANK() OVER(ORDER BY geom <-> '<POI_WKT>'::GEOMETRY) AS _rnk
  FROM   <areas>
  -- WHERE ST_Area(geom) BETWEEN <min> AND <max>
  LIMIT  <areas_row_count/3>
) q
WHERE  _rnk = 1
ORDER BY
       ST_Area(geom)
LIMIT  1
;

Here

  • all <areas>.geom having an equal distance to the given poi will get the same RANK (this includes 0.0 for those that contain the poi), thus
    • if a poi is within one or multiple <areas>.geom, or if more than one <areas>.geom are equally close to the poi, the query returns the one with the smallest ST_Area, otherwise it returns the single closest <areas>.geom to the given poi (no matter the ST_Area)
  • because of the ORDER BY in the RANK Window function and the inner LIMIT, the operation is a fully index driven (K)NN search; the actual LIMIT value is not important as long as it is less than one third of the <areas> table row count

If that rule set is not sufficient and e.g. in the case where a poi does not lie within an <areas>.geom and you want to consider the ST_Area of the k nearest neighbors, too, to to decide which one to return, you'd need to RANK over a factor between ST_Distance & ST_Area (as mentioned by @dr_jts) - which, because of the dynamic ST_Distance partner geometry, cannot get indexed.

Update:

You could use an augmented RANK over ST_Area and ORDER BY the sum of both ranked values - using e.g. its SQRT will have a strong bias towards smaller areas; other rank amplifiers are possible, too:

SELECT  id,
        geom
FROM    (
    SELECT  id,
            area_rank,
            geom,
            SQRT(RANK() OVER(ORDER BY ST_Area(geom))) AS _arnk,
            RANK() OVER(ORDER BY geom <-> '<POI_WKT>'::GEOMETRY) AS _drnk
    FROM    <areas>
    -- WHERE ST_Area(geom) BETWEEN <min> AND <max>
    LIMIT   <areas_row_count/3>
) q
ORDER BY
        _arnk + _drnk
LIMIT   1
;