PostGIS – Achieving ArcGIS-like Speed with Spatial Index

algorithmpostgisspatial-indexsql

I have been using Postgis 2.0 for 3/4 of a year now and while I really enjoy using it, excessive query processing time has rendered it basically unusable for my use case.

I tend to do heavy geoprocessing on municipal datasets which often have hundreds of thousands of multipolygons. These multipolygons are sometimes shaped very irregularly and can vary from 4 points to 78,000 points per multipolygon.

For example, when I intersect a parcel dataset with 329,152 multipolygons with a jurisdiction dataset containing 525 multipolygons, I get the following stats for total time consumed:

ArcGIS 10.0 (on same host with windows 7 OS): 3 minutes
Postgis:56 minutes (not including geometry pre-processing queries)

In other words, it requires 1500% more time to do this intersection in Postgis than in ArcGIS –and this is one of my more simple queries!

One of the reasons ArcGIS supposedly runs faster is due to better indexes. Some programmers recently figured out how these indexes work and I am wondering if anyone knows how to build these indexes in Postgis (or build tables that would mimic the indexes). Perhaps this would solve most of the speed issues in Postgis. I can only hope there must be some way, especially since ArcGIS can only use 4 GB of RAM while I could use up to 4 times that for my postgis server!

Of course there are many reasons postgis can run slowly, so I will provide a detailed version of my system specs:

Machine: Dell XPS 8300 
Processor: i7-2600 CPU @ 3.40 GHz 3.40 GHz 
Memory: Total Memory 16.0 GB (10.0 GB on virtual machine)

Platform: Ubuntu Server 12.04 Virtual Box VM

Potgres Version: 9.1.4
Postgis Version: POSTGIS="2.0.1 r9979" GEOS="3.3.5-CAPI-1.7.5" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.1, released 2012/05/15" LIBXML="2.7.8" LIBJSON="UNKNOWN" TOPOLOGY RASTER

I also detail the entire installation process that I used to set up postgis including the creation of the VM itself.

I also increased shared memory from the default 24MB to 6 GB in the conf file and ran the following commands to allow postgres to run:

sudo sysctl -w kernel.shmmax=7516192768 (I know this setting is deleted every time you restart the OS)
sudo /etc/init.d/postgresql restart

As far as I can tell this does absolutely nothing noticeable in terms of performance.

Here are links to the data I used for this test:

Parcels: tcad_parcels_06142012.shp.zip from City of Austin, TX
Jurisdictions: Jurisdictional Boundaries from City of Austin, TX

Here are the steps I took to process the data:

ArcGIS

Add datasets to ArcMap
Set coordinate system to central texas feet (srid 2277)
Use intersection tool from dropdown menu

Postgis

Import parcels using:

shp2pgsql -c -s 2277 -D -i -I -W UTF-8 "tcad_parcels_06142012.shp" "public"."tcad_parcels_06142012" |psql -d postgis_testing -U postgres -h local_ip -p 5432

Import jurisdictions using:

shp2pgsql -c -s 2277 -D -i -I -W UTF-8 "jurisdictions.shp" "public"."jurisdictions" |psql -d postgis_testing -U postgres -h local_ip -p 5432

Clean invalid geometry in parcels:

DROP TABLE IF EXISTS valid_parcels;
CREATE TABLE valid_parcels(
  gid serial PRIMARY KEY,
  orig_gid integer,
  geom geometry(multipolygon,2277)
);
CREATE INDEX ON valid_parcels USING gist (geom);
INSERT INTO valid_parcels(orig_gid,geom)
  SELECT 
    gid 
    orig_gid,
    st_multi(st_makevalid(geom)) 
  FROM 
    tcad_parcels_06142012;
CLUSTER valid_parcels USING valid_parcels_geom_idx;

Clean invalid geometry in jurisdictions:

DROP TABLE IF EXISTS valid_jurisdictions;
CREATE TABLE valid_jurisdictions(
  gid serial PRIMARY KEY,
  orig_gid integer,
  geom geometry(multipolygon,2277)
);
CREATE INDEX ON valid_jurisdictions USING gist (geom);
INSERT INTO valid_jurisdictions(orig_gid,geom)
  SELECT 
    gid 
    orig_gid,
    st_multi(st_makevalid(geom)) 
  FROM 
    jurisdictions;
CLUSTER valid_jurisdictions USING valid_jurisdictions_geom_idx;

Run cluster:

cluster;

Run vacuum analyze:

vacuum analyze;

Perform intersection on cleaned tables:

CREATE TABLE parcel_jurisdictions(
  gid serial primary key,
  parcel_gid integer,
  jurisdiction_gid integer,
  isect_geom geometry(multipolygon,2277)
);
CREATE INDEX ON parcel_jurisdictions using gist (isect_geom);

INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom)
  SELECT
    a.orig_gid parcel_gid,
    b.orig_gid jurisdiction_gid,
    st_multi(st_intersection(a.geom,b.geom))
  FROM
    valid_parcels a, valid_jurisdictions b
  WHERE
    st_intersects(a.geom,b.geom);

Explain Analyze intersection query:

Total runtime: 3446860.731 ms
        Index Cond: (geom && b.geom)
  ->  Index Scan using valid_parcels_geom_idx on valid_parcels a  (cost=0.00..11.66 rows=2 width=1592) (actual time=0.030..4.596 rows=1366 loops=525)
  ->  Seq Scan on valid_jurisdictions b  (cost=0.00..113.25 rows=525 width=22621) (actual time=0.009..0.755 rows=525 loops=1)
Nested Loop  (cost=0.00..61428.74 rows=217501 width=24213) (actual time=2.625..3445946.889 rows=329152 loops=1)
  Join Filter: _st_intersects(a.geom, b.geom)

From everything I have read, my intersection query is efficient and I have absolutely no idea what I am doing wrong for the query to take 56 minutes on clean geometry!

Best Answer

Different approach. Knowing that the pain is in ST_Intersection, and that true/false tests are fast, trying to minimize the amount of geometry passing through the intersection might speed things up. For example, parcels that are totally contained in a jurisdiction don't need to be clipped, but ST_Intersection will still probably go to the trouble of building part of the intersection overlay before realizing it doesn't have to generate any new geometry. So this

INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom)
SELECT
  a.orig_gid AS parcel_gid,
  b.orig_gid AS jurisdiction_gid,

  st_multi(st_intersection(a.geom,b.geom)) AS geom
FROM
  valid_parcels a, valid_jurisdictions b
WHERE
  st_intersects(a.geom, b.geom) and not st_within(a.geom, b.geom)
UNION ALL
SELECT
  a.orig_gid AS parcel_gid,
  b.orig_gid AS jurisdiction_gid,
  a.geom AS geom
FROM
  valid_parcels a, valid_jurisdictions b
WHERE
  st_within(a.geom, b.geom);

Or even terser

INSERT INTO parcel_jurisdictions(parcel_gid,jurisdiction_gid,isect_geom)
SELECT
  a.orig_gid AS parcel_gid,
  b.orig_gid AS jurisdiction_gid,
  CASE 
     WHEN ST_Within(a.geom,b.geom) 
     THEN a.geom
     ELSE ST_Multi(ST_Intersection(a.geom,b.geom)) 
  END AS geom
FROM valid_parcels a
JOIN valid_jurisdictions b
ON ST_Intersects(a.geom, b.geom)

Might even be faster w/o the UNION.

Related Solutions

[GIS] Importing MultiPolygon into PostGIS

Your MultiPolygon is invalid. You can test it with ogrinfo:

ogrinfo -ro -dialect sqlite -sql "select IsValid(geometry) from
OGRGeoJSON" multipolygon.json
INFO: Open of `multipolygon.json'
      using driver `GeoJSON' successful.
GEOS warning: Self-intersection at or near point -115.907104 39.162109000000001

Layer name: SELECT
Geometry: None
Feature Count: 1
Layer SRS WKT:
(unknown)
isvalid(geometry): Integer (0.0)
OGRFeature(SELECT):0
  isvalid(geometry) (Integer) = 0

You can try to correct it in PostGIS with ST_MakeValid http://postgis.net/docs/ST_MakeValid.html. However, it is not simple

Step 1 Correct the faulty geometry:

create table valid as select st_makevalid(geometry) as geometry from multipolygon;

Step 2 Test:

select * from valid where ST_Intersects(ST_PointFromText('POINT( -116.024551 38.485773 )',4326), geometry);

Result: error

ERROR:  Relate Operation called with a LWGEOMCOLLECTION type.  This is unsupported.
HINT:  Change argument 2: 'GEOMETRYCOLLECTION(POLYGON((-115.000793 38.677307,-115.000793 38.499878,-115....'

Pity that ST_Intersects does not handle geometrycollections. So you must get rid of the collection. First part of the collection is the polygon so you can take just that.

Step 3 Keep just the polygon part

update valid set geometry=ST_GeometryN(geometry,1);

Step 4 Test again

select * from valid where ST_Intersects(ST_PointFromText('POINT( -116.024551 38.485773 )',4326), geometry);

Result: Success!

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Importing MultiPolygon into PostGIS

[GIS] How to use St_intersects with different geometry type

Related Question