[GIS] Python Postgres Multiprocessing speed increase

parallel processingpostgresqlpython

I'm trying to speed up my PostGIS queries using multiprocessing. My current setup is using python and psycopg2 such as below. Although this has given a speed increase, it still seems like there are bottlenecks preventing rapid speed increases and I'm not sure where to go next.

I've increased a lot of the postgres parameters as suggested in 'Performance Tuning Postgres', but when I run this in AWS I never seem to be anywhere near maxing out RAM or I/O which should be the limitations for DB activity apparently. Can anyone suggest other methods for speeding this up?

> import os, sys, psycopg2, multiprocessing, time 
> start = time.time() 
> conn = psycopg2.connect("dbname=template_postgis_20 user=postgres") 
> cur = conn.cursor() 
> cur.execute("""SELECT count(*) FROM sites""") 
> count = cur.fetchall()
> def getOidRanges(rownums, count):
> 
>     oidranges = []
>     for row in rownums:
>         minoid = int(row[0])
>     return oidranges
> 
> def mp(rownums, whereclause):
> 
>     for row in rownums:
>         if row[0] == whereclause:
>             gid1 = int(row[0])
>             cur.execute("""
>             UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2
>             WHERE ST_Within(addressbaseplusbh1_2.geom, (select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
>             as postcode LIMIT 1) WHERE gid = {0};""".format(gid1))
>             conn.commit()
> 
>     return
> 
> if __name__ == "__main__":
> 
>     nbrquery = ("""SELECT gid FROM sites ORDER BY gid;""")
>     cur.execute(nbrquery)
>     rownums=cur.fetchall()
> 
>     cores = (multiprocessing.cpu_count()-1)
>     procfeaturelimit = 1
> 
>     oidranges = getOidRanges(rownums, count)
> 
>     if len(oidranges) > 0:
>         pool = multiprocessing.Pool(cores)
> 
>         for oidrange in oidranges:
> 
>             whereclause = oidrange[0]
>             jobs = pool.apply_async(mp, (rownums, whereclause))
> 
>         pool.close()
>         pool.join()
>         jobs.get()
> 
>         try:
>             conn.commit()
>             cur.close()
>             conn.close()
>             end = time.time()
>             print end - start
>         except:
>             pass

EDIT:

@Craig, would it work then just to have this as the executed block?

curs.execute("""UPDATE sites SET postcode=(SELECT field62 FROM (SELECT field62, COUNT(field62) FROM addressbaseplusbh1_2 WHERE ST_Within(addressbaseplusbh1_2.geom, 
(select geom from sites where gid={0})) GROUP BY field62 ORDER BY count DESC)
as postcode LIMIT 1) WHERE gid = {0};""".format(whereclause))
return

Best Answer

You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.

There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.

See Mike Gleason's parallel processing function for the details.

Key performance tip: use the boundary table as the table to split, not the points.

Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.

Related Solutions

[GIS] Multiprocessing error in QGIS with Python on Windows

I was wrong in my comment to original post. Suggested workaround does work. In case someone else has that issue here is what needs to be done in QGIS 2.0 before instantiating Manager().

# OSGeo4W does not bundle python in exec_prefix for python
path = os.path.abspath(os.path.join(sys.exec_prefix, '../../bin/pythonw.exe'))
mp.set_executable(path)
sys.argv = [ None ]

Note that this cannot be tested from python console as Windows lacks fork() and all multiprocessing statements shall be isolated.

If one wants to play around with multiprocessing and embedded python from OSGeo4W bundle outside of QGIS here is the code.

tst.py

import multiprocessing as mp
import sys, os

print("Non-isolated statement")

if __name__ == '__main__':
    print("I'm in main module")
    path = os.path.abspath(os.path.join(sys.exec_prefix, '../../bin/pythonw.exe'))
    mp.set_executable(path)
    print("Setting executable path to {:s}".format(path))
    sys.argv = [ None ]               # '../tst.py' __file__
    mgr = mp.Manager()
    print("I'm past Manager()")

tst.c

#include <Python.h>
#include <stdio.h>

int main(int argc, char * argv[]) {
  char buf[10240] = {0};
  size_t sz, res;
  FILE *f;
  Py_SetProgramName(argv[0]);  /* optional but recommended */
  Py_Initialize();

  f = fopen("../tst.py", "r");
  // obtain file size:
  fseek(f, 0 , SEEK_END);
  sz = ftell(f);
  rewind(f);
  res = fread(buf, 1, sz, f);
  fclose(f);
  PyRun_SimpleString(buf);
  getchar();
  /* PyRun_SimpleString("from time import time,ctime\n" */
  /*                    "print 'Today is',ctime(time())\n"); */
  Py_Finalize();
  return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 2.8)
find_package(PythonLibs)
include_directories(${PYTHON_INCLUDE_DIRS})

add_executable(tst tst.c)
target_link_libraries(tst ${PYTHON_LIBRARY})

if(MSVC)
  set_target_properties(tst PROPERTIES LINK_FLAGS -NODEFAULTLIB:python27_d)
endif()

From your OSGeo4W console, initialize your vcvarsall.bat building environment
Create subfolder build (or alike) and cd to it
Use cmake-gui .. to generate jom/nmake makefiles in build folder provided all files are saved in the parent one
Use nmake or jom to build tst.exe
Try to run tst or python ..\tst.py

[GIS] Python and QGIS multiprocessing documentation

Just to add to @PolyGeo's answer that I also could not find any official documentation regarding multiprocessing (if it even exists!). But there is also another method, described in this blog, which uses multithreading in QGIS which might be useful.

Main difference between the two methods are (more of which is discussed here):

Multiprocessing allows multiple processors to simultaneously run separate sets of instructions (threads). A main advantage of this method is that if an error occurs in one process, it will not have an effect on the other processes.
Multithreading allows for specific operations within a single application to be subdivided further into individual threads. The main advantage of this method is that each of these threads can be run in parallel but due care must be taken as if an error occurs in a single thread, the whole operation could crash.

Best Answer

Related Solutions

[GIS] Multiprocessing error in QGIS with Python on Windows

[GIS] Python and QGIS multiprocessing documentation

Related Question