[GIS] How to write to a single table via Python multiprocessing

arcpyparallel processingpython

A comment in the post Can multiprocessing with arcpy be run in a script tool? got me thinking, as I often need to do exactly this:

Just beware of deadlocking situations (two Insert cursors in the same
table for instance)

My question is, how can you write to a single table when using multiprocessing?

Here's an example script, which iterates through the sample City layer, and uses multiprocessing to copy each city's values to an output table via an insertCursor (this simplifies the more complicated scenario I have in mind).

# Testing how to write to an output fGDB table via multiple threads
import os, sys, arcpy, multiprocessing
from multiprocessing import Process, Queue

def Worker(input, output):
    for func in iter(input.get, 'STOP'):
        inputs = func
        doProcess(inputs)

def doProcess(inputs):
    outTblName = inputs[0]
    city = inputs[1]
    pop = inputs[2]
    with arcpy.da.InsertCursor(os.path.join(arcpy.env.scratchGDB, outTblName), ["NAME", "POPULATION"]) as iCursor:
        try:
            iCursor.insertRow([city, pop])
        except:
            print("Problem inserting " + city + " : " + str(pop) + " : trying again" )
            doProcess(inputs)

if __name__ == '__main__':
    NUMBER_OF_PROCESSES = 8
    task_queue = Queue()
    done_queue = Queue()

    inFC = "C:\Program Files (x86)\ArcGIS\Desktop10.2\TemplateData\TemplateData.gdb\World\City"
    outTblName = "testTable"

    #Create the empty table
    outTable = os.path.join(arcpy.env.scratchGDB, outTblName)
    if(arcpy.Exists(outTable)):
       arcpy.Delete_management(outTable)
    arcpy.CreateTable_management(arcpy.env.scratchGDB, outTblName)
    arcpy.AddField_management(outTable, "Name", "TEXT")
    arcpy.AddField_management(outTable, "Population", "DOUBLE")

    #Iterate through the cities. Send each one to the multiprocessor
    with arcpy.da.SearchCursor(inFC, ["NAME", "POPULATION"]) as sCursor:
        for city in sCursor:
            cityName = city[0]
            pop = city[1]
            task_queue.put([outTblName, cityName, pop])

    for i in range(NUMBER_OF_PROCESSES):
        Process(target=Worker, args=(task_queue, done_queue)).start()

    for i in range(NUMBER_OF_PROCESSES):
        task_queue.put('STOP')

As expected it runs into problems when multiple threads try to access the output table simultaneously. Even when calling the doProcess function again recursively after errors are detected, the output table contains fewer rows than the input table.

An idea is for each thread to create a new table, and to append them all at the end. Are there any best-practise suggestions?

Best Answer

Never tried multiprocessing, decided to give it a go. This script:

import os, sys, arcpy, multiprocessing
from arcpy import env
env.overwriteoutput=1
scratchGDB=r'd:\rubbish\TEST.gdb'

def function(inputs):
    print ("got arg %s" % inputs)
    outTblName = inputs[0]
    city = inputs[1]
    pop = inputs[2]
    with arcpy.da.InsertCursor(os.path.join(scratchGDB, outTblName), ["NAME", "POPULATION"]) as iCursor:
        try:
            iCursor.insertRow([city, pop])
        except:
            print("Problem inserting " + city + " : " + str(pop) + " : trying again" )

if __name__ == "__main__":
    number_of_cpus = 5
    outTblName = "testTable"
    outTable = os.path.join(scratchGDB, outTblName)
    if(arcpy.Exists(outTable)):
       arcpy.Delete_management(outTable)
    arcpy.CreateTable_management(scratchGDB, outTblName)
    arcpy.AddField_management(outTable, "Name", "TEXT")
    arcpy.AddField_management(outTable, "Population", "DOUBLE")

    bList=[]
    for i in range (number_of_cpus):
        bList.append([outTblName,chr(65+i),i*i])
    pool = multiprocessing.Pool(number_of_cpus)
    for i in pool.map(function, bList):
        print("Writing")
    rows=arcpy.da.TableToNumPyArray(os.path.join(scratchGDB, outTblName),["NAME", "POPULATION"])
    print (rows)

Gave me this output: enter image description here

It works as expected.

Related Solutions

[GIS] Multiprocessing error in QGIS with Python on Windows

I was wrong in my comment to original post. Suggested workaround does work. In case someone else has that issue here is what needs to be done in QGIS 2.0 before instantiating Manager().

# OSGeo4W does not bundle python in exec_prefix for python
path = os.path.abspath(os.path.join(sys.exec_prefix, '../../bin/pythonw.exe'))
mp.set_executable(path)
sys.argv = [ None ]

Note that this cannot be tested from python console as Windows lacks fork() and all multiprocessing statements shall be isolated.

If one wants to play around with multiprocessing and embedded python from OSGeo4W bundle outside of QGIS here is the code.

tst.py

import multiprocessing as mp
import sys, os

print("Non-isolated statement")

if __name__ == '__main__':
    print("I'm in main module")
    path = os.path.abspath(os.path.join(sys.exec_prefix, '../../bin/pythonw.exe'))
    mp.set_executable(path)
    print("Setting executable path to {:s}".format(path))
    sys.argv = [ None ]               # '../tst.py' __file__
    mgr = mp.Manager()
    print("I'm past Manager()")

tst.c

#include <Python.h>
#include <stdio.h>

int main(int argc, char * argv[]) {
  char buf[10240] = {0};
  size_t sz, res;
  FILE *f;
  Py_SetProgramName(argv[0]);  /* optional but recommended */
  Py_Initialize();

  f = fopen("../tst.py", "r");
  // obtain file size:
  fseek(f, 0 , SEEK_END);
  sz = ftell(f);
  rewind(f);
  res = fread(buf, 1, sz, f);
  fclose(f);
  PyRun_SimpleString(buf);
  getchar();
  /* PyRun_SimpleString("from time import time,ctime\n" */
  /*                    "print 'Today is',ctime(time())\n"); */
  Py_Finalize();
  return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 2.8)
find_package(PythonLibs)
include_directories(${PYTHON_INCLUDE_DIRS})

add_executable(tst tst.c)
target_link_libraries(tst ${PYTHON_LIBRARY})

if(MSVC)
  set_target_properties(tst PROPERTIES LINK_FLAGS -NODEFAULTLIB:python27_d)
endif()

From your OSGeo4W console, initialize your vcvarsall.bat building environment
Create subfolder build (or alike) and cd to it
Use cmake-gui .. to generate jom/nmake makefiles in build folder provided all files are saved in the parent one
Use nmake or jom to build tst.exe
Try to run tst or python ..\tst.py

[GIS] Python Postgres Multiprocessing speed increase

You can use dblink in a native Postgres query to split the query up into separate database connections and execute them simultaneously. This is effectively parallelism in Postgres on a single server. It could be mimicked in Python, but I haven't tried it.

There are some limitations: 1) the operation needs to be an insert, not an update. Inserts are generally faster anyway as you're not altering an existing table (depends on your HDD as far as I understand); 2) you'll need an integer ID field to be able to split the query into chunks. Adding a serial field is best as it creates a sequential integer which breaks the work up as evenly as possible.

See Mike Gleason's parallel processing function for the details.

Key performance tip: use the boundary table as the table to split, not the points.

Using this method, we can boundary tag ~10 million points in ~15,000 polygons in about a minute on a 16 core Windows 2012 Server with 128Gb RAM on an SSD. It could run faster in Linux, but I haven't tested it.

Best Answer

Related Solutions

[GIS] Multiprocessing error in QGIS with Python on Windows

[GIS] Python Postgres Multiprocessing speed increase

Related Question