[GIS] Improving the speed of a Postgis spatial query

optimizationpostgispostgresqlspatial-query

I have a query that is slow on PostgreSQL 9 + PostGIS (running on a Linux computer), despite my efforts to improve its speed.

I basically have a table that contains Polygons called polygones_zones. Actually these polygons come from MultiPolygons that represent land areas broken down into mere polygons in a previous step using the PostGIS function ST_Dump() – the reason for doing this was to improve the speed of the query.
This table has about 330,000 rows.

On the other hand, I have a table with Points called sites. This table has about 290,000 rows but when running the query, I currently use only a ~4000 sample of them before using the spatial joint.

I want to know in which areas (the original MultiPolygons) the Points are contained (Postgis function ST_Contains() ).

Everything has a 2D geometry : the Polygons were created using ST_Force2D and on the table with the points, there is a constraint "CHECK (ST_NDims(geom) = 2)". All geometries have been controlled as valid.

I have a GIST index on the geometry columns of both tables and have run a VACUUM ANALYZE and a TABLE REINDEX on both of them just before running the query.

The table definitions :

CREATE TABLE polygones_zones (
    pkid serial PRIMARY KEY,
    category varchar(50) NOT NULL,
    value varchar (255),
    origine varchar(255) NOT NULL,
    origine_id integer,
    geom geometry(Polygon, 2154) 
    CONSTRAINT enforce_dims_geom CHECK (ST_NDims(geom) = 2)
);
CREATE INDEX polygones_zones_geometry ON polygones_zones USING GIST(geom);
CREATE INDEX ON polygones_zones (origine, origine_id) ;

CREATE TABLE sites (
    pkid serial PRIMARY KEY,
    origine varchar(20) NOT NULL, 
    origine_id varchar(20) NOT NULL, 
    status varchar(255) NOT NULL,
    geom geometry(Point, 2154),
    CONSTRAINT enforce_dims_geom CHECK (ST_NDims(geom) = 2)
); 

CREATE UNIQUE INDEX ON sites (origine, origine_id ) ;
CREATE INDEX ON sites (status);
CREATE INDEX sites_geometry ON sites USING GIST ( geom );

And now the query :

SELECT 
        sites.pkid, 
        zones.origine ,
        zones.origine_id ,
        FALSE,
        zones.category,
        zones.value
        FROM 
        (
            SELECT pkid, geom
            FROM sites
            WHERE status = 'selected'
        ) AS sites
        JOIN 
        (
            SELECT DISTINCT ON (origine_id , origine) 
                category, value, geom, origine_id , origine 
                FROM polygones_zones 
        ) AS zones
            ON ST_Contains(zones.geom , sites.geom)
;

The reason for the "DISTINCT ON (origine_id, origine)" is that when the MultiPolygons have been broken down into Polygons, the identifiers of the original MultiPolygons have been recorder in these two fields. Hence, if a multipolygon has been broken into two overlapping polygons that both contain on of the points in the table sites , without the DISTINCT, the JOIN would return two rows instead of one. Using the DISTINCT ON ensures that for a single original MultiPolygon, the JOIN will only return one row.

Running this query on the sample of Points that is about 1/100 of my data takes about 30 mn. That would mean the query on the whole data set should take about 50 hours !
What can I do to improve its speed ?


Edit : after John Powell's pertinent comment, here is the result of EXPLAIN :

"Nested Loop  (cost=525286.40..547598.60 rows=24 width=40)"
"  Join Filter: ((polygones_zones.geom ~ sites.geom) AND _st_contains(polygones_zones.geom, sites.geom))"
"  ->  Index Scan using sites_statut_idx on sites  (cost=0.42..4.44 rows=1 width=36)"
"        Index Cond: ((statut)::text = 'selected'::text)"
"  ->  Unique  (cost=525285.98..527794.86 rows=72658 width=1599)"
"        ->  Sort  (cost=525285.98..526122.27 rows=334517 width=1599)"
"              Sort Key: polygones_zones.origine_id, polygones_zones.origine"
"              ->  Seq Scan on polygones_zones  (cost=0.00..30379.17 rows=334517 width=1599)"

Best Answer

Answer : after John Powell's comment, the EXPLAIN showed that it was the DISTINCT that was taking way too much time. It also made me notice that this DISTINCT was producing wrong results.

I have moved the DISTINCT clause so now the query runs in under 4s :

SELECT 
        DISTINCT ON (origine_id , origine) 
        sites.pkid, 
        zones.origine ,
        zones.origine_id ,
        FALSE,
        zones.category,
        zones.value
        FROM 
        (
            SELECT pkid, geom
            FROM sites
            WHERE status = 'selected'
        ) AS sites
        JOIN polygones_zones AS zones 
             ON ST_Contains(zones.geom , sites.geom)

;