[GIS] Efficiently selecting related records using ArcPy

arcpyenterprise-geodatabaseoptimization

Below is the code I'm using to replicate the "related tables" button in ArcMap. In ArcMap that button selects features in one feature class or table based on the selection of features in another related feature class or table.

In ArcMap I can use that button to "push" my selection to the related table in a matter of seconds. I was unable to find anything built in to arcpy that replicates the button so I used some nested loops to do the same task.

The code below loops through a table of "treatments". For each treatment, it loops through a list of "trees". When a match is found between the ID fields of treatment and trees, a selection occurs in the tree layer. Once a match is found for a treatment, the code does not continue to search the tree layer for additional matches. It goes back to the treatment table, selects the next treatment and again searches through the tree feature class.

The code itself works fine, but it is agonizingly slow. The "treatment table" in this case has 16,000 records. The "tree" feature class has 60,000 records.

Is there another more efficient way to recreate what ESRI is doing when it pushes the selection from one table to another? Should I be creating an index for the tables? NOTE: This data is stored in an SDE.

 # Create search cursor to loop through the treatments
treatments = arcpy.SearchCursor(treatment_tv)
treatment_field = "Facility_ID"

for treatment in treatments:

    #Get ID of treatment
    treatment_ID = treatment.getValue(treatment_field)

    # Create search cursor for looping through the trees
    trees = arcpy.SearchCursor(tree_fl)
    tree_field = "FACILITYID"

    for tree in trees:

        # Get FID of tree
        tree_FID = tree.getValue(tree_field)

        if tree_FID == treatment_FID:
            query = "FACILITYID = " + str(tree_FID)
            arcpy.SelectLayerByAttribute_management(tree_fl, "REMOVE_FROM_SELECTION", query)
            break

Best Answer

First off, yes you will definitely want to make sure your primary and foreign key fields are indexed on both tables. This lets the DBMS plan and execute queries against these fields much more efficiently.

Secondly, you are calling SelectLayerByAttribute_management in a tight, nested loop (once per tree per treatment). This is highly inefficient, for several reasons:

  • You don't need two loops, one nested within the other, for this, as far as I can tell. One will suffice.
  • Geoprocessing functions are "chunky" and take a lot of time to call compared to typical built-in Python functions. You should avoid calling them in a tight loop.
  • Asking for one record/ID at a time results in vastly more round trips to the database.

Instead, refactor your code so that you call SelectLayerByAttribute_management just once with a whereclause constructed to select all of the related records.

Borrowing a function from another answer for the whereclause construction logic, I'd imagine it would look something like this:

def selectRelatedRecords(sourceLayer, targetLayer, sourceField, targetField):
    sourceIDs = set([row[0] for row in arcpy.da.SearchCursor(sourceLayer, sourceField)])
    whereClause = buildWhereClauseFromList(targetLayer, targetField, sourceIDs)
    arcpy.AddMessage("Selecting related records using WhereClause: {0}".format(whereClause))
    arcpy.SelectLayerByAttribute_management(targetLayer, "NEW_SELECTION", whereClause)

You could call it like so: selectRelatedRecords(treatment_tv, tree_fl, "Facility_ID", "FACILITYID")

Notes:

  • This uses an arcpy.da.SearchCursor, only available at 10.1. As @PolyGeo mentioned, these cursors are much faster than their predecessors (arcpy.SearchCursor). It could be easily modified to use the old SearchCursor though:

    sourceIDs = set([row.getValue(sourceField) for row in arcpy.SearchCursor(sourceLayer, "", "", sourceField)])
    
  • If your SDE geodatabase is on Oracle, be warned that the IN statement used in the function from the linked answer is limited to 1000 elements. One possible solution is described in this answer, but you'd have to modify the function to split it into multiple 1000-length IN statements instead of one.