[GIS] Python Postgres Multiprocessing speed increase

parallel processingpostgresqlpython

I'm trying to speed up my PostGIS queries using multiprocessing. My current setup is using python and psycopg2 such as below. Although this has given a speed increase, it still seems like there are bottlenecks preventing rapid speed increases and I'm not sure where to go next.

I've increased a lot of the postgres parameters as suggested in 'Performance Tuning Postgres', but when I run this in AWS I never seem to be anywhere near maxing out RAM or I/O which should be the limitations for DB activity apparently. Can anyone suggest other methods for speeding this up?

> import os, sys, psycopg2, multiprocessing, time 
> start = time.time() 
> conn = psycopg2.connect("dbname=template_postgis_20 user=postgres") 
> cur = conn.cursor() 
> cur.execute("""SELECT count(*) FROM sites""") 
> count = cur.fetchall()
> def getOidRanges(rownums, count):
> 
>     oidranges = []
>     for row in rownums:
>         minoid = int(row[0])
>     return oidranges
> 
> def mp(rownums, whereclause):
> 
>     for row in rownums:
>         if row[0] == whereclause:
>             gid1 = int(row[0])
>             cur.execute("""
>             UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2
>             WHERE ST_Within(addressbaseplusbh1_2.geom, (select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
>             as postcode LIMIT 1) WHERE gid = {0};""".format(gid1))
>             conn.commit()
> 
>     return
> 
> if __name__ == "__main__":
> 
>     nbrquery = ("""SELECT gid FROM sites ORDER BY gid;""")
>     cur.execute(nbrquery)
>     rownums=cur.fetchall()
> 
>     cores = (multiprocessing.cpu_count()-1)
>     procfeaturelimit = 1
> 
>     oidranges = getOidRanges(rownums, count)
> 
>     if len(oidranges) > 0:
>         pool = multiprocessing.Pool(cores)
> 
>         for oidrange in oidranges:
> 
>             whereclause = oidrange[0]
>             jobs = pool.apply_async(mp, (rownums, whereclause))
> 
>         pool.close()
>         pool.join()
>         jobs.get()
> 
>         try:
>             conn.commit()
>             cur.close()
>             conn.close()
>             end = time.time()
>             print end - start
>         except:
>             pass

EDIT:

@Craig, would it work then just to have this as the executed block?

curs.execute("""UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2 WHERE ST_Within(addressbaseplusbh1_2.geom, 
(select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
as postcode LIMIT 1) WHERE gid = {0};""".format(whereclause))
return

Best Answer

You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.

There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.

See Mike Gleason's parallel processing function for the details.

Key performance tip: use the boundary table as the table to split, not the points.

Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.

Related Question