[GIS] How to utilize NumPy arrays to optimize big data geoprocessing

arcgis-desktoparcpybig datanumpypython

I'm interested in learning how to utilize NumPy arrays to optimize geoprocessing. Much of my work involves "big data", where geoprocessing often takes days to accomplish certain tasks. Needless to say, I am very interested in optimizing these routines. ArcGIS 10.1 has a number of NumPy functions that can be accessed via arcpy, including:

  1. NumPyArrayToFeatureClass (arcpy.da)
  2. RasterToNumPyArray (arcpy)
  3. TableToNumPyArray (arcpy.da)

For example purposes, let's say I want to optimize the following processing intensive workflow utilizing NumPy arrays:

enter image description here

The general idea here is that there are a huge number of vector-based points that move through both vector and raster-based operations resulting in a binary integer raster dataset.

How could I incorporate NumPy arrays to optimize this type of workflow?

Best Answer

I think the crux of the question here is which tasks in your workflow are not really ArcGIS dependent? Obvious candidates include tabular and raster operations. If the data must start and end within a gdb or some other ESRI format, then you need to figure out how to minimize the cost of this reformat (i.e., minimize the number of round trips) or even justify it--simply might be too expensive to rationalize. Another tactic is to modify your workflow to use python-friendly data models earlier (for instance, how soon could you ditch vector polygons?).

To echo @gene, while numpy/scipy are really great, don't assume that these are the only approaches available. You can also use lists, sets, dictionaries as alternative structures (although @blah238's link is pretty clear about efficiency differentials), there are also generators, iterators, and all kinds of other great, fast, efficient tools for working these structures in python. Raymond Hettinger, one of the Python developers, has all kinds of great general Python content out there. This video is a nice example.

Also, to add onto @blah238's idea on multiplexed processing, if you're writing/executing within IPython (not just the "regular" python environment), you can use their "parallel" package for exploiting multiple cores. I'm no whiz with this stuff, but find it a bit higher-level/newbie-friendly than the multiprocessing stuff. Probably really just an issue of personal religion there, so take that with a grain of salt. There's a good overview at about it starting at 2:13:00 in this video. The whole video is great for IPython in general.

Related Question