[GIS] Improve performance of a PostGIS st_dwithin query

interpolationperformancepostgis

I am doing a local statistic, similar to the the one described in an earlier question / answer:

SELECT a.tree_id, a.species, avg(b.age) as age_avg, count(*) as samples, a.geom
FROM trees a LEFT JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id, a.species, a.geom;

This finds all trees of the same species in a radius of 100 meters. This works quite nicely for small datasets with few missing data points (WHERE a.age IS NULL).

However, when I run the query for a larger dataset with more missing data, it gets very slow (i.e. several hours / days). In this case 6000 of of a total of 200000 points have no value (a.age).

Do you see a way to increase the speed of the query? Maybe an alternative function to st_dwithin is helpful?

UPDATE – EXPLAIN ANALYZE says:

HashAggregate (cost=2721694211.11..2721694280.52 rows=6941 width=26) (actual time=12571.761..12571.774 rows=8 loops=1)

Group Key: a.id, a.age, a.species

-> Nested Loop Left Join (cost=0.00..2721694141.70 rows=6941 width=26) (actual time=655.504..12570.495 rows=167 loops=1)
Join Filter: ((a.geom && st_expand(b.geom, 300::double precision)) AND (b.geom && st_expand(a.geom, 300::double precision)) AND _st_dwithin(a.geom, b.geom, 300::double precision))
Rows Removed by Join Filter: 11210316

-> Seq Scan on trees a (cost=0.00..251349.76 rows=6941 width=54) (actual time=213.037..1006.055 rows=8 loops=1)
Filter: ((age IS NULL))
Rows Removed by Filter: 1401302

-> Materialize (cost=0.00..254946.52 rows=1438701 width=36) (actual time=0.004..326.335 rows=1401310 loops=8)

-> Seq Scan on trees b (cost=0.00..247753.01 rows=1438701 width=36) (actual time=0.011..1490.796 rows=1401310 loops=1)

Planning time: 0.186 ms
Execution time: 12597.034 ms

Best Answer

The explain doesn't show an index coming into play, which could be for two reasons:

You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

Related Solutions

[GIS] Improve performance on a st_dwithin query (in PostGIS)

Did you create spatial indexes for both tables and CLUSTER on those? In my experience this really speeds up these kinds of queries and the CLUSTER part is often neglected by users and tutorials, etc.

CREATE INDEX line_2010_index ON line USING GIST (geom); 
CLUSTER line USING line_index;

CREATE INDEX raw_2015_index ON raw_2015 USING GIST (the_geom); 
CLUSTER raw_2015 USING raw_2015_index;

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Improve performance on a st_dwithin query (in PostGIS)

[GIS] How to use St_intersects with different geometry type

Related Question