The data access module was introduced with ArcGIS version 10.1. ESRI describes the data access module as follows (source):
The data access module, arcpy.da, is a Python module for working with
data. It allows control of the edit session, edit operation, improved
cursor support (including faster performance), functions for
converting tables and feature classes to and from NumPy arrays, and
support for versioning, replicas, domains, and subtypes workflows.
However, there is very little information regarding why cursor performance is so improved over the previous generation of cursors.
The attached figure shows the results of a benchmark test on the new da
method UpdateCursor versus the old UpdateCursor method. Essentially, the script performs the following workflow:
- Create random points (10, 100, 1000, 10000, 100000)
- Randomly sample from a normal distribution and add value to a new
column in the random points attribute table with a cursor - Run 5 iterations of each random point scenario for both the new and old UpdateCursor methods and write the mean value to lists
- Plot the results
What is going on behind the scenes with the da
update cursor to improve the cursor performance to the degree shown in the figure?
import arcpy, os, numpy, time
arcpy.env.overwriteOutput = True
outws = r'C:\temp'
fc = os.path.join(outws, 'randomPoints.shp')
iterations = [10, 100, 1000, 10000, 100000]
old = []
new = []
meanOld = []
meanNew = []
for x in iterations:
arcpy.CreateRandomPoints_management(outws, 'randomPoints', '', '', x)
arcpy.AddField_management(fc, 'randFloat', 'FLOAT')
for y in range(5):
# Old method ArcGIS 10.0 and earlier
start = time.clock()
rows = arcpy.UpdateCursor(fc)
for row in rows:
# generate random float from normal distribution
s = float(numpy.random.normal(100, 10, 1))
row.randFloat = s
rows.updateRow(row)
del row, rows
end = time.clock()
total = end - start
old.append(total)
del start, end, total
# New method 10.1 and later
start = time.clock()
with arcpy.da.UpdateCursor(fc, ['randFloat']) as cursor:
for row in cursor:
# generate random float from normal distribution
s = float(numpy.random.normal(100, 10, 1))
row[0] = s
cursor.updateRow(row)
end = time.clock()
total = end - start
new.append(total)
del start, end, total
meanOld.append(round(numpy.mean(old),4))
meanNew.append(round(numpy.mean(new),4))
#######################
# plot the results
import matplotlib.pyplot as plt
plt.plot(iterations, meanNew, label = 'New (da)')
plt.plot(iterations, meanOld, label = 'Old')
plt.title('arcpy.da.UpdateCursor -vs- arcpy.UpdateCursor')
plt.xlabel('Random Points')
plt.ylabel('Time (minutes)')
plt.legend(loc = 2)
plt.show()
Best Answer
One of the developers of
arcpy.da
here. We got the performance where it is because performance was our primary concern: the main gripe with the old cursors were that they were slow, not that they lacked any particular functionality. The code uses the same underlying ArcObjects available in ArcGIS since 8.x (the CPython implementation of the search cursor, for example, looks a lot like code samples like this in its implementation except, you know, in C++ instead of C#).The main two things we did to get the speedup are thus:
IDispatch
stuff adds a lot of complexity and slowdown to function calls.Row
object and the really strangewhile cursor.Next():
dance were just plain inefficient in Python. Fetching an item from a list is a very fast operation, and simplifies down to just a couple of CPython function calls (basically a__getitem__
call, heavily optimized on lists). Doingrow.getValue("column")
by comparison is more heavyweight: it does a__getattr__
to fetch the method (on which it needs to create a new bound method object), then call that method with the given arguments (__call__
). Each part of thearcpy.da
implementation is very closely integrated with the CPython API with a lot of hand-tuned C++ to make it fast, using native Python data structures (and numpy integration, too, for even more speed and memory efficiency).You'll also notice that in nearly any benchmark (see these slides for example), arcobjects in .Net and C++ are still over twice as fast as
arcpy.da
in most tasks. Python code usingarcpy.da
is faster, but still not faster than a compiled, lower-level language.TL;DR:
da
is faster becauseda
is implemented in straight-up, unadulterated Arcobjects/C++/CPython which was specifically designed to result in fast Python code.