Parallelizing GIS Operations in PyQGIS – Methods

clipparallel processingpyqgis

A common requirement in GIS is to apply a processing tool to a number of files or apply a process for a number of features in one file to another file.

Much of these operations are embarrassingly parallel in that the results of the calculations in no way influence any other operation in the loop. Not only that but often the input files are each distinct.

A classic case in point is the tiling out of shape files against files containing polygons to clip them against.

Here is a (tested) classical procedural method to achieve this in a python script for QGIS. (fyi the output of temporary memory files to real files more than halved the time to process my test files)

import processing
import os
input_file="/path/to/input_file.shp"
clip_polygons_file="/path/to/polygon_file.shp"
output_folder="/tmp/test/"
input_layer = QgsVectorLayer(input_file, "input file", "ogr")
QgsMapLayerRegistry.instance().addMapLayer(input_layer)
tile_layer  = QgsVectorLayer(clip_polygons_file, "clip_polys", "ogr")
QgsMapLayerRegistry.instance().addMapLayer(tile_layer)
tile_layer_dp=input_layer.dataProvider()
EPSG_code=int(tile_layer_dp.crs().authid().split(":")[1])
tile_no=0
clipping_polygons = tile_layer.getFeatures()
for clipping_polygon in clipping_polygons:
    print "Tile no: "+str(tile_no)
    tile_no+=1
    geom = clipping_polygon.geometry()
    clip_layer=QgsVectorLayer("Polygon?crs=epsg:"+str(EPSG_code)+\
    "&field=id:integer&index=yes","clip_polygon", "memory")
    clip_layer_dp = clip_layer.dataProvider()
    clip_layer.startEditing()
    clip_layer_feature = QgsFeature()
    clip_layer_feature.setGeometry(geom)
    (res, outFeats) = clip_layer_dp.addFeatures([clip_layer_feature])
    clip_layer.commitChanges()
    clip_file = os.path.join(output_folder,"tile_"+str(tile_no)+".shp")
    write_error = QgsVectorFileWriter.writeAsVectorFormat(clip_layer, \
    clip_file, "system", \
    QgsCoordinateReferenceSystem(EPSG_code), "ESRI Shapefile")
    QgsMapLayerRegistry.instance().addMapLayer(clip_layer)
    output_file = os.path.join(output_folder,str(tile_no)+".shp")
    processing.runalg("qgis:clip", input_file, clip_file, output_file)
    QgsMapLayerRegistry.instance().removeMapLayer(clip_layer.id())

This would be fine except that my input file is 2GB and the polygon clipping file contains 400+ polygons. The resulting process takes over a week on my quad core machine. All the while three cores are just idling.

The solution I have in my head is to export the process out to script files and run them asynchronously using gnu parallel for example. However it seems a shame to have to drop out of QGIS into an OS specific solution rather than use something native to QGIS python. So my question is:

Can I parallelise embarrassingly parallel geographic operations natively inside python QGIS?

If not, then perhaps someone already has the code to send this sort of work off to asynchronous shell scripts?

Best Answer

If you change your program to read the file name from the command line and split up your input file in smaller chunks, you can do something like this using GNU Parallel:

parallel my_processing.py {} /path/to/polygon_file.shp ::: input_files*.shp

This will run 1 job per core.

All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program for every line in a file
  • Run the same program for every block in a file

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related Question