[GIS] How to utilize NumPy arrays to optimize big data geoprocessing

arcgis-desktoparcpybig datanumpypython

I'm interested in learning how to utilize NumPy arrays to optimize geoprocessing. Much of my work involves "big data", where geoprocessing often takes days to accomplish certain tasks. Needless to say, I am very interested in optimizing these routines. ArcGIS 10.1 has a number of NumPy functions that can be accessed via arcpy, including:

For example purposes, let's say I want to optimize the following processing intensive workflow utilizing NumPy arrays:

enter image description here

The general idea here is that there are a huge number of vector-based points that move through both vector and raster-based operations resulting in a binary integer raster dataset.

How could I incorporate NumPy arrays to optimize this type of workflow?

Best Answer

I think the crux of the question here is which tasks in your workflow are not really ArcGIS dependent? Obvious candidates include tabular and raster operations. If the data must start and end within a gdb or some other ESRI format, then you need to figure out how to minimize the cost of this reformat (i.e., minimize the number of round trips) or even justify it--simply might be too expensive to rationalize. Another tactic is to modify your workflow to use python-friendly data models earlier (for instance, how soon could you ditch vector polygons?).

To echo @gene, while numpy/scipy are really great, don't assume that these are the only approaches available. You can also use lists, sets, dictionaries as alternative structures (although @blah238's link is pretty clear about efficiency differentials), there are also generators, iterators, and all kinds of other great, fast, efficient tools for working these structures in python. Raymond Hettinger, one of the Python developers, has all kinds of great general Python content out there. This video is a nice example.

Also, to add onto @blah238's idea on multiplexed processing, if you're writing/executing within IPython (not just the "regular" python environment), you can use their "parallel" package for exploiting multiple cores. I'm no whiz with this stuff, but find it a bit higher-level/newbie-friendly than the multiprocessing stuff. Probably really just an issue of personal religion there, so take that with a grain of salt. There's a good overview at about it starting at 2:13:00 in this video. The whole video is great for IPython in general.

Related Solutions

[GIS] XYZ Clustering in Python and numpy arrays

scikit-learn has an extensive clustering library with many different methods available. As a bonus scikit-learn is one of the best documented Python libraries I've seen. When working with 3d point clouds I've had a lot of success with DBSCAN for instance.

Alternately as @Fezter suggests above, scipy offers two different methods of clustring: k-means (vector-quantization) and hierarchical classification. While there isn't the number of methods available scipy is often a good place to start as the clustering methods don't tend to require as much setup.

[GIS] Raster to Numpy Array – No data values

this is a sample code I found in ESRI Documents. as you see in code, defined a constant value for "NoData"

import arcpy
import numpy

# Get input Raster properties
inRas = arcpy.Raster('C:/data/inRaster')
lowerLeft = arcpy.Point(inRas.extent.XMin,inRas.extent.YMin)
cellSize = ras.meanCellWidth

# Convert Raster to numpy array
arr = arcpy.RasterToNumPyArray(inRas,nodata_to_value=0)

# Calculate percentage of the row for each cell value
arrSum = arr.sum(1)
arrSum.shape = (arr.shape[0],1)
arrPerc = (arr)/arrSum

#Convert Array to raster (keep the origin and cellsize the same as the input)
newRaster = arcpy.NumPyArrayToRaster(arrPerc,lowerLeft,cellSize,
                                     value_to_nodata=0)
newRaster.save("C:/output/fgdb.gdb/PercentRaster")

you can find more information in detail here "RasterToNumPyArray"

Best Answer

Related Solutions

[GIS] XYZ Clustering in Python and numpy arrays

[GIS] Raster to Numpy Array – No data values

Related Question