I'm trying to speed up my PostGIS queries using multiprocessing. My current setup is using python and psycopg2 such as below. Although this has given a speed increase, it still seems like there are bottlenecks preventing rapid speed increases and I'm not sure where to go next.
I've increased a lot of the postgres parameters as suggested in 'Performance Tuning Postgres', but when I run this in AWS I never seem to be anywhere near maxing out RAM or I/O which should be the limitations for DB activity apparently. Can anyone suggest other methods for speeding this up?
> import os, sys, psycopg2, multiprocessing, time
> start = time.time()
> conn = psycopg2.connect("dbname=template_postgis_20 user=postgres")
> cur = conn.cursor()
> cur.execute("""SELECT count(*) FROM sites""")
> count = cur.fetchall()
> def getOidRanges(rownums, count):
>
> oidranges = []
> for row in rownums:
> minoid = int(row[0])
> return oidranges
>
> def mp(rownums, whereclause):
>
> for row in rownums:
> if row[0] == whereclause:
> gid1 = int(row[0])
> cur.execute("""
> UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2
> WHERE ST_Within(addressbaseplusbh1_2.geom, (select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
> as postcode LIMIT 1) WHERE gid = {0};""".format(gid1))
> conn.commit()
>
> return
>
> if __name__ == "__main__":
>
> nbrquery = ("""SELECT gid FROM sites ORDER BY gid;""")
> cur.execute(nbrquery)
> rownums=cur.fetchall()
>
> cores = (multiprocessing.cpu_count()-1)
> procfeaturelimit = 1
>
> oidranges = getOidRanges(rownums, count)
>
> if len(oidranges) > 0:
> pool = multiprocessing.Pool(cores)
>
> for oidrange in oidranges:
>
> whereclause = oidrange[0]
> jobs = pool.apply_async(mp, (rownums, whereclause))
>
> pool.close()
> pool.join()
> jobs.get()
>
> try:
> conn.commit()
> cur.close()
> conn.close()
> end = time.time()
> print end - start
> except:
> pass
EDIT:
@Craig, would it work then just to have this as the executed block?
curs.execute("""UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2 WHERE ST_Within(addressbaseplusbh1_2.geom,
(select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
as postcode LIMIT 1) WHERE gid = {0};""".format(whereclause))
return
Best Answer
You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.
There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.
See Mike Gleason's parallel processing function for the details.
Key performance tip: use the boundary table as the table to split, not the points.
Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.