[GIS] Improve performance of a PostGIS st_dwithin query

interpolationperformancepostgis

I am doing a local statistic, similar to the the one described in an earlier question / answer:

SELECT a.tree_id, a.species, avg(b.age) as age_avg, count(*) as samples, a.geom
FROM trees a LEFT JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id, a.species, a.geom;

This finds all trees of the same species in a radius of 100 meters. This works quite nicely for small datasets with few missing data points (WHERE a.age IS NULL).

However, when I run the query for a larger dataset with more missing data, it gets very slow (i.e. several hours / days). In this case 6000 of of a total of 200000 points have no value (a.age).

Do you see a way to increase the speed of the query? Maybe an alternative function to st_dwithin is helpful?

UPDATEEXPLAIN ANALYZE says:

HashAggregate (cost=2721694211.11..2721694280.52 rows=6941 width=26) (actual time=12571.761..12571.774 rows=8 loops=1)

Group Key: a.id, a.age, a.species

-> Nested Loop Left Join (cost=0.00..2721694141.70 rows=6941 width=26) (actual time=655.504..12570.495 rows=167 loops=1)
Join Filter: ((a.geom && st_expand(b.geom, 300::double precision)) AND (b.geom && st_expand(a.geom, 300::double precision)) AND _st_dwithin(a.geom, b.geom, 300::double precision))
Rows Removed by Join Filter: 11210316

-> Seq Scan on trees a (cost=0.00..251349.76 rows=6941 width=54) (actual time=213.037..1006.055 rows=8 loops=1)
Filter: ((age IS NULL))
Rows Removed by Filter: 1401302

-> Materialize (cost=0.00..254946.52 rows=1438701 width=36) (actual time=0.004..326.335 rows=1401310 loops=8)

-> Seq Scan on trees b (cost=0.00..247753.01 rows=1438701 width=36) (actual time=0.011..1490.796 rows=1401310 loops=1)

Planning time: 0.186 ms
Execution time: 12597.034 ms

Best Answer

The explain doesn't show an index coming into play, which could be for two reasons:

  • You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
  • Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

Related Question