[GIS] Postgis performance increase with cluster

postgispostgresql

I've been reading a few questions on this site and saw quite a few people advocating for creating a cluster function.

So I've decided to test this function to see if there was any notable speed increase. I've used a table with 35,000 points. Both tables have the spatial index but tableb has the cluster

create table tableb as select * from tablea

CREATE INDEX tableb_sidx ON tableb  USING gist  (geom);

CLUSTER tableb USING tableb_sidx

select a.*
from tablea a, inventory b
where st_dwithin(a.geom, b.geom, 200)

select a.*
from tableb a, inventory b
where st_dwithin(a.geom, b.geom, 200)

Running these st_dwithin queries result in the exact same running time of 1 minute

Am I doing something wrong or does this cluster not apply to every spatial query?

Best Answer

According to the research I've done in the past, the effects of spatial fragmentation are measurable at 10k rows, barely perceptible at 100k rows, noticeable at 1m rows, and pronounced at 10m rows. These were worst case models, with randomly distributed data, and 2-5% return sets, so it's likely that the impact of barely fragmented data or with very small return volume would not be noticible without extensive testing (sample sizes large enough to show statistically significant variance).

It seems likely that your performance can be improved, but the issue is not spatial fragmentation. You should defragment both tables(on an attribute index which is spatially correlated, like zipcode or province) and run VACUUM FULL ANALYZE before further testing. After that you should use EXPLAIN to determine the query plan and move forward from there.