[GIS] Improve performance on a st_dwithin query (in PostGIS)

postgisst-dwithin

At the momenent I am working on a query, as described in an earlier question. I have the two tables

testme (tracked GPS-profile with Point Geometry) and
roads (geometry roads shapefile)

Besides the distance of each tracked point (row) in table pt in want to find the closest point using a combination of ST_Distance and ST_DWithin.

DROP TABLE IF EXISTS raw_2015_processed;
EXPLAIN ANALYZE
CREATE TABLE raw_2015_processed AS
SELECT pt.id, 
    pt."DeliveryID",
    pt."VehicleID",
    pt."TrackID",
    pt."Longitude",
    pt."Latitude",
    pt."Altitude",
    pt."Heading",
    pt."Speed",
    pt."Satelites",
    pt."HDOP",
    pt."VDOP",
    pt."Xfcd",
    pt.ts,
    pt.received,
    pt.the_geom,
(SELECT ST_ClosestPoint(line.geom,pt.the_geom) AS closest_geom
    FROM roads AS line 
    WHERE ST_DWithin(line.geom,pt.the_geom, 0.5) LIMIT 1),
(SELECT ST_Distance(line.geom,pt.the_geom) AS distance
    FROM roads AS line
    ORDER BY pt.the_geom <#> line.geom LIMIT 1),
(SELECT ST_AsText(ST_ClosestPoint(line.geom, pt.the_geom)) AS closest_coordinates
    FROM roads AS line
    ORDER BY pt.the_geom <#> line.geom LIMIT 1)
FROM raw_2015 AS pt
ORDER by pt.id;`

Working with a reduced file 'Testme' of 144 rows EXPLAIN ANALYZE returns following Query Plan:

"Sort  (cost=84207.04..84207.40 rows=144 width=131) (actual time=11797.677..11797.685 rows=144 loops=1)"
"  Sort Key: pt.id"
"  Sort Method: quicksort  Memory: 54kB"
"  ->  Seq Scan on testme pt  (cost=0.00..84201.87 rows=144 width=131) (actual time=82.458..11797.268 rows=144 loops=1)"
"        SubPlan 1"
"          ->  Limit  (cost=9.46..578.16 rows=1 width=155) (actual time=81.508..81.508 rows=1 loops=144)"
"                ->  Bitmap Heap Scan on roads line  (cost=9.46..578.16 rows=1 width=155) (actual time=81.388..81.388 rows=1 loops=144)"
"                      Recheck Cond: (geom && st_expand(pt.the_geom, 0.5::double precision))"
"                      Rows Removed by Index Recheck: 0"
"                      Filter: ((pt.the_geom && st_expand(geom, 0.5::double precision)) AND _st_dwithin(geom, pt.the_geom, 0.5::double precision))"
"                      ->  Bitmap Index Scan on geom_index_roads  (cost=0.00..9.46 rows=139 width=0) (actual time=79.722..79.722 rows=450402 loops=144)"
"                            Index Cond: (geom && st_expand(pt.the_geom, 0.5::double precision))"
"        SubPlan 2"

However, when I run the query for a larger dataset with more data points (< 5 Million points) , it gets very slow (i.e. several hours / days). Do you guys see a way to increase the speed of the query? Is there an an alternative option to st_dwithin or a different query structure which can proof to be helpful?

Best Answer

Did you create spatial indexes for both tables and CLUSTER on those? In my experience this really speeds up these kinds of queries and the CLUSTER part is often neglected by users and tutorials, etc.

CREATE INDEX line_2010_index ON line USING GIST (geom); 
CLUSTER line USING line_index;

CREATE INDEX raw_2015_index ON raw_2015 USING GIST (the_geom); 
CLUSTER raw_2015 USING raw_2015_index;

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

The explain doesn't show an index coming into play, which could be for two reasons:

You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

[GIS] How to use St_intersects with different geometry type

Related Question