PostGIS – Achieving ArcGIS-like Speed with Spatial Index

algorithmpostgisspatial-indexsql

I have been using Postgis 2.0 for 3/4 of a year now and while I really enjoy using it, excessive query processing time has rendered it basically unusable for my use case.

I tend to do heavy geoprocessing on municipal datasets which often have hundreds of thousands of multipolygons. These multipolygons are sometimes shaped very irregularly and can vary from 4 points to 78,000 points per multipolygon.

For example, when I intersect a parcel dataset with 329,152 multipolygons with a jurisdiction dataset containing 525 multipolygons, I get the following stats for total time consumed:

ArcGIS 10.0 (on same host with windows 7 OS): 3 minutes
Postgis:56 minutes (not including geometry pre-processing queries)

In other words, it requires 1500% more time to do this intersection in Postgis than in ArcGIS –and this is one of my more simple queries!

One of the reasons ArcGIS supposedly runs faster is due to better indexes. Some programmers recently figured out how these indexes work and I am wondering if anyone knows how to build these indexes in Postgis (or build tables that would mimic the indexes). Perhaps this would solve most of the speed issues in Postgis. I can only hope there must be some way, especially since ArcGIS can only use 4 GB of RAM while I could use up to 4 times that for my postgis server!

Of course there are many reasons postgis can run slowly, so I will provide a detailed version of my system specs:

Machine: Dell XPS 8300 
Processor: i7-2600 CPU @ 3.40 GHz 3.40 GHz 
Memory: Total Memory 16.0 GB (10.0 GB on virtual machine)

Platform: Ubuntu Server 12.04 Virtual Box VM

Potgres Version: 9.1.4
Postgis Version: POSTGIS="2.0.1 r9979" GEOS="3.3.5-CAPI-1.7.5" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.1, released 2012/05/15" LIBXML="2.7.8" LIBJSON="UNKNOWN" TOPOLOGY RASTER

I also detail the entire installation process that I used to set up postgis including the creation of the VM itself.

I also increased shared memory from the default 24MB to 6 GB in the conf file and ran the following commands to allow postgres to run:

sudo sysctl -w kernel.shmmax=7516192768 (I know this setting is deleted every time you restart the OS)
sudo /etc/init.d/postgresql restart

As far as I can tell this does absolutely nothing noticeable in terms of performance.

Here are links to the data I used for this test:

  1. Parcels: tcad_parcels_06142012.shp.zip from City of Austin, TX
  2. Jurisdictions: Jurisdictional Boundaries from City of Austin, TX

Here are the steps I took to process the data:

ArcGIS

  1. Add datasets to ArcMap
  2. Set coordinate system to central texas feet (srid 2277)
  3. Use intersection tool from dropdown menu

Postgis

Import parcels using:

shp2pgsql -c -s 2277 -D -i -I -W UTF-8 "tcad_parcels_06142012.shp" "public"."tcad_parcels_06142012" |psql -d postgis_testing -U postgres -h local_ip -p 5432

Import jurisdictions using:

shp2pgsql -c -s 2277 -D -i -I -W UTF-8 "jurisdictions.shp" "public"."jurisdictions" |psql -d postgis_testing -U postgres -h local_ip -p 5432

Clean invalid geometry in parcels:

DROP TABLE IF EXISTS valid_parcels;
CREATE TABLE valid_parcels(
  gid serial PRIMARY KEY,
  orig_gid integer,
  geom geometry(multipolygon,2277)
);
CREATE INDEX ON valid_parcels USING gist (geom);
INSERT INTO valid_parcels(orig_gid,geom)
  SELECT 
    gid 
    orig_gid,
    st_multi(st_makevalid(geom)) 
  FROM 
    tcad_parcels_06142012;
CLUSTER valid_parcels USING valid_parcels_geom_idx;

Clean invalid geometry in jurisdictions:

DROP TABLE IF EXISTS valid_jurisdictions;
CREATE TABLE valid_jurisdictions(
  gid serial PRIMARY KEY,
  orig_gid integer,
  geom geometry(multipolygon,2277)
);
CREATE INDEX ON valid_jurisdictions USING gist (geom);
INSERT INTO valid_jurisdictions(orig_gid,geom)
  SELECT 
    gid 
    orig_gid,
    st_multi(st_makevalid(geom)) 
  FROM 
    jurisdictions;
CLUSTER valid_jurisdictions USING valid_jurisdictions_geom_idx;

Run cluster:

cluster;

Run vacuum analyze:

vacuum analyze;

Perform intersection on cleaned tables:

CREATE TABLE parcel_jurisdictions(
  gid serial primary key,
  parcel_gid integer,
  jurisdiction_gid integer,
  isect_geom geometry(multipolygon,2277)
);
CREATE INDEX ON parcel_jurisdictions using gist (isect_geom);

INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom)
  SELECT
    a.orig_gid parcel_gid,
    b.orig_gid jurisdiction_gid,
    st_multi(st_intersection(a.geom,b.geom))
  FROM
    valid_parcels a, valid_jurisdictions b
  WHERE
    st_intersects(a.geom,b.geom);

Explain Analyze intersection query:

Total runtime: 3446860.731 ms
        Index Cond: (geom && b.geom)
  ->  Index Scan using valid_parcels_geom_idx on valid_parcels a  (cost=0.00..11.66 rows=2 width=1592) (actual time=0.030..4.596 rows=1366 loops=525)
  ->  Seq Scan on valid_jurisdictions b  (cost=0.00..113.25 rows=525 width=22621) (actual time=0.009..0.755 rows=525 loops=1)
Nested Loop  (cost=0.00..61428.74 rows=217501 width=24213) (actual time=2.625..3445946.889 rows=329152 loops=1)
  Join Filter: _st_intersects(a.geom, b.geom)

From everything I have read, my intersection query is efficient and I have absolutely no idea what I am doing wrong for the query to take 56 minutes on clean geometry!

Best Answer

Different approach. Knowing that the pain is in ST_Intersection, and that true/false tests are fast, trying to minimize the amount of geometry passing through the intersection might speed things up. For example, parcels that are totally contained in a jurisdiction don't need to be clipped, but ST_Intersection will still probably go to the trouble of building part of the intersection overlay before realizing it doesn't have to generate any new geometry. So this

INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom)
SELECT
  a.orig_gid AS parcel_gid,
  b.orig_gid AS jurisdiction_gid,

  st_multi(st_intersection(a.geom,b.geom)) AS geom
FROM
  valid_parcels a, valid_jurisdictions b
WHERE
  st_intersects(a.geom, b.geom) and not st_within(a.geom, b.geom)
UNION ALL
SELECT
  a.orig_gid AS parcel_gid,
  b.orig_gid AS jurisdiction_gid,
  a.geom AS geom
FROM
  valid_parcels a, valid_jurisdictions b
WHERE
  st_within(a.geom, b.geom);

Or even terser

INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom)
SELECT
  a.orig_gid AS parcel_gid,
  b.orig_gid AS jurisdiction_gid,
  CASE 
     WHEN ST_Within(a.geom,b.geom) 
     THEN a.geom
     ELSE ST_Multi(ST_Intersection(a.geom,b.geom)) 
  END AS geom
FROM valid_parcels a
JOIN valid_jurisdictions b
ON ST_Intersects(a.geom, b.geom)

Might even be faster w/o the UNION.