[GIS] Postgis performance increase with cluster

postgispostgresql

I've been reading a few questions on this site and saw quite a few people advocating for creating a cluster function.

So I've decided to test this function to see if there was any notable speed increase. I've used a table with 35,000 points. Both tables have the spatial index but tableb has the cluster

create table tableb as select * from tablea

CREATE INDEX tableb_sidx ON tableb  USING gist  (geom);

CLUSTER tableb USING tableb_sidx

select a.*
from tablea a, inventory b
where st_dwithin(a.geom, b.geom, 200)

select a.*
from tableb a, inventory b
where st_dwithin(a.geom, b.geom, 200)

Running these st_dwithin queries result in the exact same running time of 1 minute

Am I doing something wrong or does this cluster not apply to every spatial query?

Best Answer

According to the research I've done in the past, the effects of spatial fragmentation are measurable at 10k rows, barely perceptible at 100k rows, noticeable at 1m rows, and pronounced at 10m rows. These were worst case models, with randomly distributed data, and 2-5% return sets, so it's likely that the impact of barely fragmented data or with very small return volume would not be noticible without extensive testing (sample sizes large enough to show statistically significant variance).

It seems likely that your performance can be improved, but the issue is not spatial fragmentation. You should defragment both tables(on an attribute index which is spatially correlated, like zipcode or province) and run VACUUM FULL ANALYZE before further testing. After that you should use EXPLAIN to determine the query plan and move forward from there.

Related Solutions

[GIS] PostGIS Intermittent INDEX Performance

Effectively forcing the planner to do the thing you want might help. In this case, sub-setting the polygon table prior to executing the spatial join with the points table. You might be able to outwit the planner using "WITH" syntax:

WITH polys AS (
  SELECT * FROM area
  WHERE area.id in IN(28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11)
)
SELECT * 
FROM polys JOIN data ON ST_Intersects(data.point, polys.polygon)
WHERE datasetid IN(3) 
AND (readingdatetime BETWEEN '1948-01-01' AND '2012-11-19');

The trouble with trying to play these games is that you are coding into your statement the assumption "my polygon list will always be more selective than my other query portions". Which might not be true for all parameterizations of your query, or for all applications of a particular query over a heterogeneously distributed dataset.

But it might work.

UPDATE: This goes even further down the dangerous road of assuming you know the selectivity of your clauses beforehand, this time we also take the attribute selection on the point table out and do it separately before the spatial join:

WITH polys AS (
  SELECT * FROM area
  WHERE area.id in IN(28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11)
),
WITH points AS (
  SELECT * FROM data
  WHERE datasetid IN(3) 
  AND (readingdatetime BETWEEN '1948-01-01' AND '2012-11-19')
)
SELECT * 
FROM polys JOIN points ON ST_Intersects(points, polys.polygon);

[GIS] PostGIS – ST_Within or ST_Disjoint performance issues

Ironically, the fastest way to find the set of things not within other things is to do a full join that finds the contained things, but using a LEFT JOIN, so the un-matched things are hanging about to be found, thus:

SELECT pts.*
FROM pts LEFT JOIN polys
ON ST_Contains(polys.geom, pts.geom)
WHERE polys.id IS NULL;

The un-matched rows in a left join are returned as NULL so doing an IS NULL test on a column that you know is declared NOT NULL finds you all the un-matched rows.

Best Answer

Related Solutions

[GIS] PostGIS Intermittent INDEX Performance

[GIS] PostGIS – ST_Within or ST_Disjoint performance issues

Related Question