I have a bunch of UK postcodes, which are basically lat/lng points. I currently map these to user-defined "areas" as follows:
- Create a very small box around the point.
- Search for areas which intersect that.
- If I find any, choose the one with the smallest area.
- If I don't, step up the box and try again.
Some notes:
- Point 3 is crucial. One area might be "England"; another might be "Camden". I want to map to Camden, not England.
- The areas have various geometries, though I could make them all polygons if need be.
- The areas may overlap. There are gaps between areas.
- A point may therefore be within one or more areas, in which case I want the smallest, or not within any, in which case I want the closest.
- I have min and max bounds I want to apply to the size of the areas, but if need be I could do that outside the scope of this question by pre-filtering the data.
What I have currently, using MySQL and spatial indexes, is like this (edited for brevity):
$swlat = round($loc['lat'], 2) - self::LOCATION_STEP;
$swlng = round($loc['lng'], 2) - self::LOCATION_STEP;
$nelat = round($loc['lat'], 2) + self::LOCATION_STEP;
$nelng = round($loc['lng'], 2) + self::LOCATION_STEP;
$count = 0;
do {
$poly = "POLYGON(($swlng $swlat, $swlng $nelat, $nelng $nelat, $nelng $swlat, $swlng $swlat))";
# We want to find the smallest nearby or containing area.
#
# Exclude locations which are very large, e.g. Greater London or too small (probably just a
# single building.
$sql = "SELECT locations.id, locations.name, locations.maxdimension, ST_AsText(locations_spatial.geometry) AS geom,
CASE WHEN ST_Intersects(locations_spatial.geometry, ST_GeomFromText('POINT({$loc['lng']} {$loc['lat']})', {$this->dbhr->SRID()})) THEN 0
ELSE ST_Distance(locations_spatial.geometry, ST_GeomFromText('POINT({$loc['lng']} {$loc['lat']})', {$this->dbhr->SRID()})) END AS dist
FROM locations_spatial INNER JOIN `locations` ON locations.id = locations_spatial.locationid
LEFT OUTER JOIN locations_excluded ON locations_excluded.locationid = locations.id
WHERE
ST_Intersects(locations_spatial.geometry, ST_GeomFromText('$poly', {$this->dbhr->SRID()})) AND type != 'Postcode'
AND ST_Dimension(locations_spatial.geometry) = 2 AND locations_excluded.locationid IS NULL
AND locations_spatial.locationid != $id AND maxdimension < " . self::TOO_LARGE . " AND maxdimension > " . self::TOO_SMALL . "
ORDER BY maxdimension ASC, dist ASC LIMIT 1;";
$nearbyes = $this->dbhr->preQuery($sql);
if (count($nearbyes) === 0) {
$swlat -= self::LOCATION_STEP;
$swlng -= self::LOCATION_STEP;
$nelat += self::LOCATION_STEP;
$nelng += self::LOCATION_STEP;
$count++;
}
} while (count($nearbyes) == 0 && $count < self::LOCATION_MAX);
…but it is slow and I would like to speed it up.
What is an efficient way of doing this nowadays? More specifically, is there a way to avoid this loop approach (either using PostGIS or other tech)?
I'm familiar with the KNN problem, but I don't think this is exactly the same. I have dabbled with PostgreSQL/PostGIS, but I'm not familiar in depth with what is/isn't efficient there.
Best Answer
Given your note
things are simple: rank the ordered distance to nearby (or containing) areas, and find the smallest one in that set:
Here
<areas>.geom
having an equal distance to the givenpoi
will get the sameRANK
(this includes0.0
for those that contain thepoi
), thuspoi
is within one or multiple<areas>.geom
, or if more than one<areas>.geom
are equally close to thepoi
, the query returns the one with the smallestST_Area
, otherwise it returns the single closest<areas>.geom
to the givenpoi
(no matter theST_Area
)ORDER BY
in theRANK
Window function and the innerLIMIT
, the operation is a fully index driven (K)NN search; the actualLIMIT
value is not important as long as it is less than one third of the<areas>
table row countIf that rule set is not sufficient and e.g. in the case where a
poi
does not lie within an<areas>.geom
and you want to consider theST_Area
of the k nearest neighbors, too, to to decide which one to return, you'd need toRANK
over a factor betweenST_Distance
&ST_Area
(as mentioned by @dr_jts) - which, because of the dynamicST_Distance
partner geometry, cannot get indexed.Update:
You could use an augmented
RANK
overST_Area
andORDER BY
the sum of both ranked values - using e.g. itsSQRT
will have a strong bias towards smaller areas; other rank amplifiers are possible, too: