A common requirement in GIS is to apply a processing tool to a number of files or apply a process for a number of features in one file to another file.
Much of these operations are embarrassingly parallel in that the results of the calculations in no way influence any other operation in the loop. Not only that but often the input files are each distinct.
A classic case in point is the tiling out of shape files against files containing polygons to clip them against.
Here is a (tested) classical procedural method to achieve this in a python script for QGIS. (fyi the output of temporary memory files to real files more than halved the time to process my test files)
import processing
import os
input_file="/path/to/input_file.shp"
clip_polygons_file="/path/to/polygon_file.shp"
output_folder="/tmp/test/"
input_layer = QgsVectorLayer(input_file, "input file", "ogr")
QgsMapLayerRegistry.instance().addMapLayer(input_layer)
tile_layer = QgsVectorLayer(clip_polygons_file, "clip_polys", "ogr")
QgsMapLayerRegistry.instance().addMapLayer(tile_layer)
tile_layer_dp=input_layer.dataProvider()
EPSG_code=int(tile_layer_dp.crs().authid().split(":")[1])
tile_no=0
clipping_polygons = tile_layer.getFeatures()
for clipping_polygon in clipping_polygons:
print "Tile no: "+str(tile_no)
tile_no+=1
geom = clipping_polygon.geometry()
clip_layer=QgsVectorLayer("Polygon?crs=epsg:"+str(EPSG_code)+\
"&field=id:integer&index=yes","clip_polygon", "memory")
clip_layer_dp = clip_layer.dataProvider()
clip_layer.startEditing()
clip_layer_feature = QgsFeature()
clip_layer_feature.setGeometry(geom)
(res, outFeats) = clip_layer_dp.addFeatures([clip_layer_feature])
clip_layer.commitChanges()
clip_file = os.path.join(output_folder,"tile_"+str(tile_no)+".shp")
write_error = QgsVectorFileWriter.writeAsVectorFormat(clip_layer, \
clip_file, "system", \
QgsCoordinateReferenceSystem(EPSG_code), "ESRI Shapefile")
QgsMapLayerRegistry.instance().addMapLayer(clip_layer)
output_file = os.path.join(output_folder,str(tile_no)+".shp")
processing.runalg("qgis:clip", input_file, clip_file, output_file)
QgsMapLayerRegistry.instance().removeMapLayer(clip_layer.id())
This would be fine except that my input file is 2GB and the polygon clipping file contains 400+ polygons. The resulting process takes over a week on my quad core machine. All the while three cores are just idling.
The solution I have in my head is to export the process out to script files and run them asynchronously using gnu parallel for example. However it seems a shame to have to drop out of QGIS into an OS specific solution rather than use something native to QGIS python. So my question is:
Can I parallelise embarrassingly parallel geographic operations natively inside python QGIS?
If not, then perhaps someone already has the code to send this sort of work off to asynchronous shell scripts?
Best Answer
If you change your program to read the file name from the command line and split up your input file in smaller chunks, you can do something like this using GNU Parallel:
This will run 1 job per core.
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel