[GIS] How to optimize this ST_Distance query

postgispostgresqlst-distancest-dwithin

I am using ST_Distance in postgis to get the closest point in a table to each polygon in another table. This query is taking a long time to run (2 hours with 18,000 records in the polygon table and 7,500 records in the point table). I am calculating the distance in miles and also limiting the search to points that are within 100 miles (160934 meters) of each polygon. What else could I do to speed this up? The geometries are all in SRID 4326, and I have created spatial indexes on both tables. I'm running postgresql 9.5 with postgis 2.2.2. Here is the query:

SELECT DISTINCT ON(g1.gid) g1.*, g2.plant_name As closest_utility_name,
g2.primsource As closest_utility_type,
round(ST_Distance(g1.geom::geography,g2.geom::geography) * 0.000621371) As closest_utility_miles
INTO parcels_utility
FROM parcels As g1, powerplants_us_201603 As g2
WHERE ST_DWithin(g1.geom, g2.geom, 160934)
ORDER BY g1.gid, closest_utility_miles;

Explain provides this query plan:

"Unique  (cost=11006.38..11006.60 rows=45 width=713)"
"  ->  Sort  (cost=11006.38..11006.49 rows=45 width=713)"
"        Sort Key: g1.gid, (round((_st_distance((g1.geom)::geography, (g2.geom)::geography, '0'::double precision, true) * '0.000621371'::double precision)))"
"        ->  Nested Loop  (cost=0.15..11005.14 rows=45 width=713)"
"              ->  Seq Scan on parcels g1  (cost=0.00..1692.01 rows=18001 width=653)"
"              ->  Index Scan using powerplants_us_201603_gix on powerplants_us_201603 g2  (cost=0.15..0.51 rows=1 width=60)"
"                    Index Cond: (geom && st_expand(g1.geom, '160934'::double precision))"
"                    Filter: ((g1.geom && st_expand(geom, '160934'::double precision)) AND _st_dwithin(g1.geom, geom, '160934'::double precision))"

Based on the comment below I've updated the ST_DWithin function to use the geography type:

SELECT DISTINCT ON(g1.gid) g1.*, g2.plant_name As closest_utility_name,
g2.primsource As closest_utility_type,
round(ST_Distance(g1.geom::geography,g2.geom::geography) * 0.000621371) As closest_utility_miles
INTO parcels_utility
FROM parcels As g1, powerplants_us_201603 As g2
WHERE ST_DWithin(g1.geom::geography, g2.geom::geography, 160934)
ORDER BY g1.gid, closest_utility_miles;

It has been running now for about an hour, the query plan is below:

"Unique  (cost=612038648.76..612038650.68 rows=200 width=455)"
"  ->  Sort  (cost=612038648.76..612038649.72 rows=383 width=455)"
"        Sort Key: g1.gid, (round((_st_distance((g1.geom)::geography, (g2.geom)::geography, '0'::double precision, true) * '0.000621371'::double precision)))"
"        ->  Nested Loop  (cost=0.00..612038632.33 rows=383 width=455)"
"              Join Filter: (((g1.geom)::geography && _st_expand((g2.geom)::geography, '160934'::double precision)) AND ((g2.geom)::geography && _st_expand((g1.geom)::geography, '160934'::double precision)) AND _st_dwithin((g1.geom)::geography, (g2.geom)::g (...)"
"              ->  Seq Scan on parcels g1  (cost=0.00..20630.55 rows=152755 width=395)"
"              ->  Materialize  (cost=0.00..430.86 rows=7524 width=60)"
"                    ->  Seq Scan on powerplants_us_201603 g2  (cost=0.00..393.24 rows=7524 width=60)"

Best Answer

You're mixing units in your original attempt, and improperly using geography in your second. I'll provide a geography solution, though if your data are local the very highest performance will be using planar (not 4326) geometry.

-- Change the data type in the tables
ALTER TABLE powerplants_us_201603 ALTER COLUMN geom TYPE geography USING geography(geom);
ALTER TABLE parcels ALTER COLUMN geom TYPE geography USING geography(geom);
ALTER TABLE powerplants_us_201603 RENAME COLUMN geom TO geog;
ALTER TABLE parcels RENAME COLUMN geom TO geog;

-- Build real geography indexes
CREATE INDEX parcels_geog_x ON parcels USING GIST (geog);
CREATE INDEX powerplants_geog_x ON powerplants_us_201603 USING GIST (geog);

-- Run the (simpler now) query
SELECT DISTINCT ON (g1.gid)
  g1.*, 
  g2.plant_name AS closest_utility_name,
  g2.primsource AS closest_utility_type,
  round(ST_Distance(g1.geog, g2.geog) / 1609.34) AS closest_utility_miles
FROM parcels AS g1, 
JOIN powerplants_us_201603 AS g2
ON ST_DWithin(g1.geog, g2.geog, 100 * 1609.34)
ORDER BY g1.gid, closest_utility_miles;
Related Question