[GIS] When to use GiST and when to use SP-GiST index

postgispostgresqlspatial-index

I'm aware of the documentation for both of the indexes, but can't decide which one to use.

Example:

Let's say we have table data with the POINT attribute that contains coordinates.
And let's also say we want to preview all points on the map to create a marker clustering. (Only inside of the boundaries of data view). As in the figure below.

This data is not balanced I suppose? what does that mean?

So why to use GiST or SP-GiST?

Best Answer

An exact answer is not possible without testing on your actual data, as this is highly dependent on the data size, structure, corresponding PostgreSQL internals, and the exact queries you intent to use the index for.

Generally speaking, it is perfectly fine to assume a GIST index is what you want.

Some basic considerations:

GIST has full operator support, including (k)NN searches
SP-GIST doesn't support (k)NN as of yet, and supports fewer operators (which is probably not a real issue, though)
GIST isn't overly sensitive to the spatial distribution (homogeneous/consistently spaced vs heterogeneous/blobs of geometries) and the topology (many overlaps vs isolated distribution) of your geometries
SP_GIST is most effective for non-overlapping geometries, and boost searches for spatially homogeneous distributions, due to its Spatial Partitioning
GIST creation time is rising slightly non-linear with the amount of data it has to ingest, but has a an overall stable increase (ballpark figure: 20 minutes for 100 million rows (points; global distribution))
SP_GIST is likely faster for smaller amount of data, but tends to have a significant performance drop after a few hundred million geometries compared to GIST
GIST indexes have a non-trivial storage impact (ballpark figure: 5GB for 100 million geometries), but only BRIN indexes really make a difference here
SP-GIST has a few percentages less space requirement

However:

Since it seems you are having homogeneously distributed POINTs, you could definitely try the SP_GIST index and see if you get more performance out of it; this is still dependent on other factors that are linked to PG internals, relation and result statistics, though.

But it will likely be slower if all you filter for is all points within a large bbox, as this is better covered with GIST.

Related Solutions

[GIS] Why does ST_Area(g.geom) + ST_Area(p.geom) – ST_Area(ST_Union(g.geom,p.geom)) produce negative result when using ST_Overlaps(g.geom, p.geom) = true

In PostGIS terms, ST_Overlaps(g1, g2) means that the two geometries intersect but neither totally contains the other. If, as you mentioned, many of the poa polygons are totally contained within the gov polygons, they will not be included in the ST_Overlaps() function. ST_Union() on the other hand, just combines all areas, removing intersections. So the area of a union in the total area of both geometries minus intersections. This could certainly be greater than the overlap.

A simple example: if gov contains one large polygon, and poa contains one smaller polygon, totally enclosed in gov, then Overlaps(gov, pos) is false, so it's area would not be included in your query, but ST_Area(ST_Union()) would return the area of gov which is > 0, thus the negative result.

[GIS] PostGIS Intermittent INDEX Performance

Effectively forcing the planner to do the thing you want might help. In this case, sub-setting the polygon table prior to executing the spatial join with the points table. You might be able to outwit the planner using "WITH" syntax:

WITH polys AS (
  SELECT * FROM area
  WHERE area.id in IN(28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11)
)
SELECT * 
FROM polys JOIN data ON ST_Intersects(data.point, polys.polygon)
WHERE datasetid IN(3) 
AND (readingdatetime BETWEEN '1948-01-01' AND '2012-11-19');

The trouble with trying to play these games is that you are coding into your statement the assumption "my polygon list will always be more selective than my other query portions". Which might not be true for all parameterizations of your query, or for all applications of a particular query over a heterogeneously distributed dataset.

But it might work.

UPDATE: This goes even further down the dangerous road of assuming you know the selectivity of your clauses beforehand, this time we also take the attribute selection on the point table out and do it separately before the spatial join:

WITH polys AS (
  SELECT * FROM area
  WHERE area.id in IN(28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11)
),
WITH points AS (
  SELECT * FROM data
  WHERE datasetid IN(3) 
  AND (readingdatetime BETWEEN '1948-01-01' AND '2012-11-19')
)
SELECT * 
FROM polys JOIN points ON ST_Intersects(points, polys.polygon);

Best Answer

Related Solutions

[GIS] Why does ST_Area(g.geom) + ST_Area(p.geom) – ST_Area(ST_Union(g.geom,p.geom)) produce negative result when using ST_Overlaps(g.geom, p.geom) = true

[GIS] PostGIS Intermittent INDEX Performance

Related Question