Actual PostGIS query execution time MUCH slower than expected

performancepostgispostgresqlspatial-indexsql

I've run into a performance problem from which I'm not sure how to move forward. I have a table with ~150M rows. And I have a query that in EXPLAIN has a cost of just cost=0.55..209.36, but in EXPLAIN ANALYZE the actual time is time=221.710..3538323.420 (it takes almost an hour to run!).

The table is more or less:

   Column   |            Type             | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+---------
 id         | bytea                       |           | not null |
 some_id    | bytea                       |           | not null |
 other_id   | bytea                       |           | not null |
 geometry   | geometry                    |           | not null |
 deleted_at | timestamp without time zone |           |          |
 ...other columns
Indexes:
    "model_id" PRIMARY KEY, btree (id)
    "model_some_id_paging" btree (some_id, id DESC) WHERE deleted_at IS NULL
    "model_other_id" btree (other_id) WHERE deleted_at IS NULL
    "model_geometry" gist (geometry) WHERE deleted_at IS NULL
    ...other indexes
Check constraints:
    "model_valid_geometry" CHECK (st_isvalid(geometry))
Referenced by:
    ...references by one other table

And the query is:

SELECT * FROM model
WHERE deleted_at IS NULL
  AND ST_Intersects(geometry, $1)
  AND other_id IN ($2) -- single value, so equivalent to other_id = $2
  AND some_id = $3;

As expected, the query planner uses the partial index on geometry, results of EXPLAIN:

 Index Scan using model_geometry on model  (cost=0.55..209.36 rows=15 width=424)
   Index Cond: (geometry && '0103000020E61000000100000005000000883E0E6F224A5EC058ACBD2F91A64340883E0E60044A5EC058ACBD2F91A64340883E0E60044A5EC0BC5459C0ADA64340883E0E6F224A5EC0BC5459C0ADA64340883E0E6F224A5EC058ACBD2F91A64340'::geometry)
   Filter: ((other_id = '...'::bytea) AND (some_id = '...'::bytea) AND _st_intersects(geometry, '0103000020E61000000100000005000000883E0E6F224A5EC058ACBD2F91A64340883E0E60044A5EC058ACBD2F91A64340883E0E60044A5EC0BC5459C0ADA64340883E0E6F224A5EC0BC5459C0ADA64340883E0E6F224A5EC058ACBD2F91A64340'::geometry))

(the geometry is not very large)

And the output of EXPLAIN ANALYZE (which takes almost an hour to execute):

 Index Scan using model_geometry on model  (cost=0.55..209.36 rows=15 width=424) (actual time=221.710..3538323.420 rows=290 loops=1)
   Index Cond: (geometry && '0103000020E61000000100000005000000883E0E6F224A5EC058ACBD2F91A64340883E0E60044A5EC058ACBD2F91A64340883E0E60044A5EC0BC5459C0ADA64340883E0E6F224A5EC0BC5459C0ADA64340883E0E6F224A5EC058ACBD2F91A64340'::geometry)
   Filter: ((other_id = '...'::bytea) AND (some_id = '...'::bytea) AND _st_intersects(geometry, '0103000020E61000000100000005000000883E0E6F224A5EC058ACBD2F91A64340883E0E60044A5EC058ACBD2F91A64340883E0E60044A5EC0BC5459C0ADA64340883E0E6F224A5EC0BC5459C0ADA64340883E0E6F224A5EC058ACBD2F91A64340'::geometry))
   Rows Removed by Filter: 12
 Planning Time: 0.297 ms
 Execution Time: 3538324.292 ms

The geometries stored in the model table are all small. ~61% are points, ~37% polygons, ~2% line strings, and the rest are multipolygons. The model table itself is 119 GB and the model_geometry index is 43 GB. I'm using AWS Aurora compatible with Postgres 11.12 and PostGIS 2.4.4.

AWS Performance insights shows that this query spends time mostly on IO:DataFileRead, which is:

A session is reading data from Aurora storage. This may be a typical wait event for I/O intensive workloads. SQL statements showing a comparatively large proportion of this wait event compared to other SQL statements may be using an inefficient query plan that requires reading large amounts of data.

Other things to note: rows with deleted_at IS NULL constitute over 99% of all the rows, and out of those almost 96.5% of rows have the requested some_id and other_id. I.e. the most selective condition is the one on geometry.

I have run vacuum verbose analyze model, but it didn't do much. I'm planning to cluster the table on geohashes of the geometry table (https://postgis.net/workshops/postgis-intro/clusterindex.html#clustering-on-geohash), but I'm not sure what else to do.

Edit: The geometries are not line strings. They are mostly points and polygons. Updated the description above with more details about the geometry types.

Best Answer

I have clustered the column on the geohash of the geometry, and it has brought the execution down to milliseconds! I used the technique referenced in my question: https://postgis.net/workshops/postgis-intro/clusterindex.html#clustering-on-geohash (So far I have tried it on an exact replica of the database, not the live system, but that shouldn't affect anything.)

Edit: what I have done eventually is that I run:

ALTER TABLE model CLUSTER ON model_geometry_geohash;

And then run pg_repack on it. It solved the problem in less time than CLUSTER, and without locking the whole table.

Related Solutions

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

[GIS] PostGIS geography intersect slower than geometry intersect

As @LR1234567 pointed out, the geography version of ST_Intersects is really just a wrapper around a distance calculation. If your geographies contain a lot of vertices, you may get better performance by forcing PostGIS to use a tree-based distance calculation.

This algorithm is much faster than the brute-force distance calculation, but PostGIS makes the wager that it's only worth using it when the tree can be reused multiple times. You can override this logic and force the use of the tree-based calculation like this:

SELECT DISTINCT g1.iso, g2.iso FROM gadm g1, gadm g2
WHERE g1.geog && g2.geog 
  AND g1.iso!=g2.iso
  AND _ST_DistanceTree(g1.geog, g2.geog) < 0.00001

Best Answer

Related Solutions

[GIS] How to use St_intersects with different geometry type

[GIS] PostGIS geography intersect slower than geometry intersect

Related Question