[GIS] How to optimize this ST_Distance query

postgispostgresqlst-distancest-dwithin

I am using ST_Distance in postgis to get the closest point in a table to each polygon in another table. This query is taking a long time to run (2 hours with 18,000 records in the polygon table and 7,500 records in the point table). I am calculating the distance in miles and also limiting the search to points that are within 100 miles (160934 meters) of each polygon. What else could I do to speed this up? The geometries are all in SRID 4326, and I have created spatial indexes on both tables. I'm running postgresql 9.5 with postgis 2.2.2. Here is the query:

SELECT DISTINCT ON(g1.gid) g1.*, g2.plant_name As closest_utility_name,
g2.primsource As closest_utility_type,
round(ST_Distance(g1.geom::geography,g2.geom::geography) * 0.000621371) As closest_utility_miles
INTO parcels_utility
FROM parcels As g1, powerplants_us_201603 As g2
WHERE ST_DWithin(g1.geom, g2.geom, 160934)
ORDER BY g1.gid, closest_utility_miles;

Explain provides this query plan:

"Unique  (cost=11006.38..11006.60 rows=45 width=713)"
"  ->  Sort  (cost=11006.38..11006.49 rows=45 width=713)"
"        Sort Key: g1.gid, (round((_st_distance((g1.geom)::geography, (g2.geom)::geography, '0'::double precision, true) * '0.000621371'::double precision)))"
"        ->  Nested Loop  (cost=0.15..11005.14 rows=45 width=713)"
"              ->  Seq Scan on parcels g1  (cost=0.00..1692.01 rows=18001 width=653)"
"              ->  Index Scan using powerplants_us_201603_gix on powerplants_us_201603 g2  (cost=0.15..0.51 rows=1 width=60)"
"                    Index Cond: (geom && st_expand(g1.geom, '160934'::double precision))"
"                    Filter: ((g1.geom && st_expand(geom, '160934'::double precision)) AND _st_dwithin(g1.geom, geom, '160934'::double precision))"

Based on the comment below I've updated the ST_DWithin function to use the geography type:

SELECT DISTINCT ON(g1.gid) g1.*, g2.plant_name As closest_utility_name,
g2.primsource As closest_utility_type,
round(ST_Distance(g1.geom::geography,g2.geom::geography) * 0.000621371) As closest_utility_miles
INTO parcels_utility
FROM parcels As g1, powerplants_us_201603 As g2
WHERE ST_DWithin(g1.geom::geography, g2.geom::geography, 160934)
ORDER BY g1.gid, closest_utility_miles;

It has been running now for about an hour, the query plan is below:

"Unique  (cost=612038648.76..612038650.68 rows=200 width=455)"
"  ->  Sort  (cost=612038648.76..612038649.72 rows=383 width=455)"
"        Sort Key: g1.gid, (round((_st_distance((g1.geom)::geography, (g2.geom)::geography, '0'::double precision, true) * '0.000621371'::double precision)))"
"        ->  Nested Loop  (cost=0.00..612038632.33 rows=383 width=455)"
"              Join Filter: (((g1.geom)::geography && _st_expand((g2.geom)::geography, '160934'::double precision)) AND ((g2.geom)::geography && _st_expand((g1.geom)::geography, '160934'::double precision)) AND _st_dwithin((g1.geom)::geography, (g2.geom)::g (...)"
"              ->  Seq Scan on parcels g1  (cost=0.00..20630.55 rows=152755 width=395)"
"              ->  Materialize  (cost=0.00..430.86 rows=7524 width=60)"
"                    ->  Seq Scan on powerplants_us_201603 g2  (cost=0.00..393.24 rows=7524 width=60)"

Best Answer

You're mixing units in your original attempt, and improperly using geography in your second. I'll provide a geography solution, though if your data are local the very highest performance will be using planar (not 4326) geometry.

-- Change the data type in the tables
ALTER TABLE powerplants_us_201603 ALTER COLUMN geom TYPE geography USING geography(geom);
ALTER TABLE parcels ALTER COLUMN geom TYPE geography USING geography(geom);
ALTER TABLE powerplants_us_201603 RENAME COLUMN geom TO geog;
ALTER TABLE parcels RENAME COLUMN geom TO geog;

-- Build real geography indexes
CREATE INDEX parcels_geog_x ON parcels USING GIST (geog);
CREATE INDEX powerplants_geog_x ON powerplants_us_201603 USING GIST (geog);

-- Run the (simpler now) query
SELECT DISTINCT ON (g1.gid)
  g1.*, 
  g2.plant_name AS closest_utility_name,
  g2.primsource AS closest_utility_type,
  round(ST_Distance(g1.geog, g2.geog) / 1609.34) AS closest_utility_miles
FROM parcels AS g1, 
JOIN powerplants_us_201603 AS g2
ON ST_DWithin(g1.geog, g2.geog, 100 * 1609.34)
ORDER BY g1.gid, closest_utility_miles;

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

The explain doesn't show an index coming into play, which could be for two reasons:

You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

[GIS] How to use St_intersects with different geometry type

Related Question