[GIS] Using two with statements with arcpy.da.SearchCursor()

arcgis-10.1arcpycursorperformance

I often see code that is comparing 2 shapefiles where there needs to be 2 with statements in the same loop. In my head I always thought both with statements should be placed first before starting looping the rows in each cursor like:

with arcpy.da.SearchCursor(fc2, field4) as sCursor1:
    with arcpy.da.SearchCursor(fc1, field5) as sCursor2:
        for row1 in sCursor1:
            for row2 in sCursor2:
                #do stuff

However I have seen code that looks like this:

with arcpy.da.SearchCursor(fc2, field4) as sCursor1:
    for row1 in sCursor1:
        with arcpy.da.SearchCursor(fc2, field5) as sCursor2:            
            for row2 in sCursor2:
                #do stuff

Purely just looking at this I would assume that the second version reopens the cursor for every row in sCursor1, and therefore the first method would be better. I have read http://preshing.com/20110920/the-python-with-statement-by-example/ and this suggests that :

The above with statement will automatically close the file after the nested block of code.

Therefore it looks like I might be correct, however is there a preferred method of the 2 or another way that would be better, such as creating the cursors first and the using the del statements at the end?

Some timings on small, med, large data would be good if possible.

Best Answer

Logically, they do effectively the same thing. For every row in one data set, loop through every line in one dataset (presumably until some criteria is satisfied, then break). Without a break on an if statement, if both datasets are 10,000 rows long, you would have to iterate through 100,000,000 rows.

EDIT:

However, the With With For For methodology doesn't work in practice because, as the documentation says, an arcpy.da.SearchCursor:

Returns an iterator of tuples.

And

Search cursors also support with statements to reset iteration and aid in removal of locks.

This means when you create a cursor you can only iterate through it once, meaning you have to delete it and re-create it for multiple iterations through the same dataset. Iterators raise a StopIteration if it reaches the end of the iterable object (http://anandology.com/python-practice-book/iterators.html) (http://pro.arcgis.com/en/pro-app/arcpy/data-access/searchcursor-class.htm)

Sometimes a problem will require this type of workflow, but the best way to optimize it is to reduce the number of times you iterate through your data. For example, if you use dict comprehension to build a dictionary of values from one table, you can call on the components of that dictionary when iterating through the other table, so you would only have to iterate through 20,000 rows to accomplish your goal, like so:

import arcpy
fc1 = "feature1"
fc2 = "feature2"
data_dict = {row[0]: row[1] for row in arcpy.da.SearchCursor(fc1, ["ID", "FIELD1"])}
with arcpy.da.SearchCursor(fc2, ["ID", "FIELD2"]) as cursor:
    for row in cursor:
        relevant_data = data_dict[row[0]]
        #Do something else

If you absolutely have to iterate through one dataset for every row in another dataset, you can use multiprocessing to significantly reduce the time. On an 8 core machine iterating through a 10,000 row dataset for every row in a 10,000 row dataset, you would accomplish the same amount of work as if you were iterating through only 12.5 million rows instead of 100 million. For example,

import arcpy
import multiprocessing
from _functools import partial

def working_function(fc1, row):
    fc1row = row
    with arcpy.da.SearchCursor(fc1, ["field1"]) as cursor:
        for row in cursor:
            ##Do some work/matching/etc

if __name__ == "__main__":
    fc1 = "feature1"
    fc2 = "feature2"
    the_data = [row for row in arcpy.da.SearchCursor(fc2, ["field2"])]
    partial_function = partial(working_function, fc1) 
    ''' creates a new function where the first parameter is always fc1'''
    pool = multiprocessing.Pool()
    pool.map(partial_function, the_data)
    pool.close()
    pool.join()

EDIT: The above example doesn't really do anything, but if you store the values you want to update or use for later in a python dictionary, you could use an Update Cursor to modify the fc1 row data after multiprocessing based on whatever criteria you specify. I've used multiprocessing on file geodatabase feature classes before with great success.

You can also copy your feature classes to the "in_memory" workspace, which stores your features in RAM and significantly speeds up the read/write time.

In_memory workspace: http://pro.arcgis.com/en/pro-app/help/analysis/geoprocessing/modelbuilder/the-in-memory-workspace.htm

Multiprocessing (https://docs.python.org/2/library/multiprocessing.html)

EDIT: You can also combine the two methodologies to further optimize, like this:

import arcpy
import multiprocessing
from _functools import partial

def working_function(data_dict, row):
    fc1row = row
    wanted_value = data_dict[row[0]]
    return wanted_value
if __name__ == "__main__":
    fc1 = "feature1"
    fc2 = "feature2"
    data_dict = {row[0]: row[1] for row in arcpy.da.SearchCursor(fc1, ["ID", "field1"])}
    the_data = [row for row in arcpy.da.SearchCursor(fc2, ["id", "field2"])]
    partial_function = partial(working_function, data_dict) 
    ''' creates a new function where the first parameter is always data_dict'''
    pool = multiprocessing.Pool()
    pool.map(partial_function, the_data)
    pool.close()
    pool.join()