ArcPy – Why Is Data Access Cursor Performance Enhanced in Recent Versions?

arcpycursorperformance

The data access module was introduced with ArcGIS version 10.1. ESRI describes the data access module as follows (source):

The data access module, arcpy.da, is a Python module for working with
data. It allows control of the edit session, edit operation, improved
cursor support (including faster performance), functions for
converting tables and feature classes to and from NumPy arrays, and
support for versioning, replicas, domains, and subtypes workflows.

However, there is very little information regarding why cursor performance is so improved over the previous generation of cursors.

The attached figure shows the results of a benchmark test on the new da method UpdateCursor versus the old UpdateCursor method. Essentially, the script performs the following workflow:

Create random points (10, 100, 1000, 10000, 100000)
Randomly sample from a normal distribution and add value to a new
column in the random points attribute table with a cursor
Run 5 iterations of each random point scenario for both the new and old UpdateCursor methods and write the mean value to lists
Plot the results

What is going on behind the scenes with the da update cursor to improve the cursor performance to the degree shown in the figure?

enter image description here

import arcpy, os, numpy, time
arcpy.env.overwriteOutput = True

outws = r'C:\temp'
fc = os.path.join(outws, 'randomPoints.shp')

iterations = [10, 100, 1000, 10000, 100000]
old = []
new = []

meanOld = []
meanNew = []

for x in iterations:
    arcpy.CreateRandomPoints_management(outws, 'randomPoints', '', '', x)
    arcpy.AddField_management(fc, 'randFloat', 'FLOAT')

    for y in range(5):

        # Old method ArcGIS 10.0 and earlier
        start = time.clock()

        rows = arcpy.UpdateCursor(fc)

        for row in rows:
            # generate random float from normal distribution
            s = float(numpy.random.normal(100, 10, 1))
            row.randFloat = s
            rows.updateRow(row)

        del row, rows

        end = time.clock()
        total = end - start
        old.append(total)

        del start, end, total

        # New method 10.1 and later
        start = time.clock()

        with arcpy.da.UpdateCursor(fc, ['randFloat']) as cursor:
            for row in cursor:
                # generate random float from normal distribution
                s = float(numpy.random.normal(100, 10, 1))
                row[0] = s
                cursor.updateRow(row)

        end = time.clock()
        total = end - start
        new.append(total)
        del start, end, total
    meanOld.append(round(numpy.mean(old),4))
    meanNew.append(round(numpy.mean(new),4))

#######################
# plot the results

import matplotlib.pyplot as plt
plt.plot(iterations, meanNew, label = 'New (da)')
plt.plot(iterations, meanOld, label = 'Old')
plt.title('arcpy.da.UpdateCursor -vs- arcpy.UpdateCursor')
plt.xlabel('Random Points')
plt.ylabel('Time (minutes)')
plt.legend(loc = 2)
plt.show()

Best Answer

One of the developers of arcpy.da here. We got the performance where it is because performance was our primary concern: the main gripe with the old cursors were that they were slow, not that they lacked any particular functionality. The code uses the same underlying ArcObjects available in ArcGIS since 8.x (the CPython implementation of the search cursor, for example, looks a lot like code samples like this in its implementation except, you know, in C++ instead of C#).

The main two things we did to get the speedup are thus:

Eliminate layers of abstraction: the initial implementation of the Python cursor was based on the old Dispatch/COM based GPDispatch object, which enabled one to use the same API in any language that could consume COM Dispatch objects. This means that you had an API that was not particularly well-optimized for any single environment, but it also meant that there were a lot of layers of abstraction for the COM objects to advertise and resolve methods at runtime, for example. If you remember before ArcGIS 9.3, it was possible to write geoprocessing scripts using that same clunky interface many languages, even Perl and Ruby. The extra paperwork an object needs to do to handle the IDispatch stuff adds a lot of complexity and slowdown to function calls.
Make a tightly integrated, Python specific C++ library using Pythonic idioms and data structures: the idea of a Row object and the really strange while cursor.Next(): dance were just plain inefficient in Python. Fetching an item from a list is a very fast operation, and simplifies down to just a couple of CPython function calls (basically a __getitem__ call, heavily optimized on lists). Doing row.getValue("column") by comparison is more heavyweight: it does a __getattr__ to fetch the method (on which it needs to create a new bound method object), then call that method with the given arguments (__call__). Each part of the arcpy.da implementation is very closely integrated with the CPython API with a lot of hand-tuned C++ to make it fast, using native Python data structures (and numpy integration, too, for even more speed and memory efficiency).

You'll also notice that in nearly any benchmark (see these slides for example), arcobjects in .Net and C++ are still over twice as fast as arcpy.da in most tasks. Python code using arcpy.da is faster, but still not faster than a compiled, lower-level language.

TL;DR: da is faster because da is implemented in straight-up, unadulterated Arcobjects/C++/CPython which was specifically designed to result in fast Python code.

Related Solutions

ArcGIS Desktop – How to Access Adjacent Rows with Cursor in ArcPy

If this was me I would pass through the table once creating a dictionary where key is OBJECTID and item is the value in field "a". I would then then step through the table with an update cursor getting the OBJECTID and from that I could get the adjacent values from the dictionary, sum them and write them back to field "b".

But in your screenshot you have special cases for rows 3 and 8 as they only have one adjacent row. What would you intend for these?

[GIS] Calculating zonal statistics of raster data in multiple overlapping zones and combining them into one table

I tested the ZonalStatistics function with overlapped polygons shapefile as zone data. Shapefile is converted into a raster, so overlapped areas are lost.

As workaround I tried to extract every polygon feature from the shapefile and process the ZonalStatistics. The script does not contain def.

Summing up what it does:

Define the input variables
Create the output table
Iterate for every polygon feature of the zonal shapefile
Extract a polygon creating a temp shapefile
Compute ZonalStatistics in a the temp DBF table
Add the FID of original polygon to temp table
Append the row to the output table
Delete temp shape and table

This is the code.

import arcpy
arcpy.CheckOutExtension("spatial")
arcpy.env.overwriteOutput = True

workdir = r'C:\Users\frederichoffmann\Desktop\ESRI_summer2014'
inraster = workdir + r'\invest_workspace\sedimentretention1990\output\rkls_1990.tif'
zonal_shp = workdir + r'\servicesheds_2014-06-24\servicesheds_v0.shp'
tab_template = workdir + r'\socioecon_settlements.gdb\USLE_sshed90'

stat_table = workdir + r'\zonaldata.gdb\statable'
arcpy.CreateTable_management(workdir + r'\zonaldata.gdb','statable', tab_template)

with arcpy.da.SearchCursor(zonal_shp, ['FID']) as rows:
    for row in rows:

        fid = row[0]

        expression = '"FID" = ' + str(fid)
        temp_shp = workdir + r'\tempshp.shp'
        arcpy.Select_analysis(zonal_shp, temp_shp, expression)

        temp_tab =  workdir + r'\temptab.dbf'
        arcpy.sa.ZonalStatisticsAsTable(temp_shp, 'FID', inraster, temp_tab, "DATA", "ALL")

        with arcpy.da.UpdateCursor(temp_tab, ['FID_']) as recs:
            for rec in recs:
                rec[0] = fid
                recs.updateRow(rec)
        del rec, recs

        arcpy.Append_management(temp_tab, stat_table, 'NO_TEST')

        arcpy.Delete_management(temp_tab)
        arcpy.Delete_management(temp_shp)

del row, rows

Best Answer

Related Solutions

ArcGIS Desktop – How to Access Adjacent Rows with Cursor in ArcPy

[GIS] Calculating zonal statistics of raster data in multiple overlapping zones and combining them into one table

Related Question