[GIS] Efficiently selecting related records using ArcPy

arcpyenterprise-geodatabaseoptimization

Below is the code I'm using to replicate the "related tables" button in ArcMap. In ArcMap that button selects features in one feature class or table based on the selection of features in another related feature class or table.

In ArcMap I can use that button to "push" my selection to the related table in a matter of seconds. I was unable to find anything built in to arcpy that replicates the button so I used some nested loops to do the same task.

The code below loops through a table of "treatments". For each treatment, it loops through a list of "trees". When a match is found between the ID fields of treatment and trees, a selection occurs in the tree layer. Once a match is found for a treatment, the code does not continue to search the tree layer for additional matches. It goes back to the treatment table, selects the next treatment and again searches through the tree feature class.

The code itself works fine, but it is agonizingly slow. The "treatment table" in this case has 16,000 records. The "tree" feature class has 60,000 records.

Is there another more efficient way to recreate what ESRI is doing when it pushes the selection from one table to another? Should I be creating an index for the tables? NOTE: This data is stored in an SDE.

 # Create search cursor to loop through the treatments
treatments = arcpy.SearchCursor(treatment_tv)
treatment_field = "Facility_ID"

for treatment in treatments:

    #Get ID of treatment
    treatment_ID = treatment.getValue(treatment_field)

    # Create search cursor for looping through the trees
    trees = arcpy.SearchCursor(tree_fl)
    tree_field = "FACILITYID"

    for tree in trees:

        # Get FID of tree
        tree_FID = tree.getValue(tree_field)

        if tree_FID == treatment_FID:
            query = "FACILITYID = " + str(tree_FID)
            arcpy.SelectLayerByAttribute_management(tree_fl, "REMOVE_FROM_SELECTION", query)
            break

Best Answer

First off, yes you will definitely want to make sure your primary and foreign key fields are indexed on both tables. This lets the DBMS plan and execute queries against these fields much more efficiently.

Secondly, you are calling SelectLayerByAttribute_management in a tight, nested loop (once per tree per treatment). This is highly inefficient, for several reasons:

You don't need two loops, one nested within the other, for this, as far as I can tell. One will suffice.
Geoprocessing functions are "chunky" and take a lot of time to call compared to typical built-in Python functions. You should avoid calling them in a tight loop.
Asking for one record/ID at a time results in vastly more round trips to the database.

Instead, refactor your code so that you call SelectLayerByAttribute_management just once with a whereclause constructed to select all of the related records.

Borrowing a function from another answer for the whereclause construction logic, I'd imagine it would look something like this:

def selectRelatedRecords(sourceLayer, targetLayer, sourceField, targetField):
    sourceIDs = set([row[0] for row in arcpy.da.SearchCursor(sourceLayer, sourceField)])
    whereClause = buildWhereClauseFromList(targetLayer, targetField, sourceIDs)
    arcpy.AddMessage("Selecting related records using WhereClause: {0}".format(whereClause))
    arcpy.SelectLayerByAttribute_management(targetLayer, "NEW_SELECTION", whereClause)

You could call it like so: selectRelatedRecords(treatment_tv, tree_fl, "Facility_ID", "FACILITYID")

Notes:

This uses an arcpy.da.SearchCursor, only available at 10.1. As @PolyGeo mentioned, these cursors are much faster than their predecessors (arcpy.SearchCursor). It could be easily modified to use the old SearchCursor though:
```
sourceIDs = set([row.getValue(sourceField) for row in arcpy.SearchCursor(sourceLayer, "", "", sourceField)])
```
If your SDE geodatabase is on Oracle, be warned that the IN statement used in the function from the linked answer is limited to 1000 elements. One possible solution is described in this answer, but you'd have to modify the function to split it into multiple 1000-length IN statements instead of one.

Related Solutions

[GIS] Does arcpy have a method for accessing related records within a cursor

If there is such an inbuilt method for ArcPy then I am unaware of it.

Curiously, the ArcInfo Workstation architecture (which preceded ArcGIS for Desktop) had a cursor implementation within its Arc Macro Language (AML) that supported working with related records.

[GIS] Auto-incrementing attribute fields using ArcPy cursors and conditional statements

The function you're using to auto increment IDs is meant for use in the field calculator; in the field calculator you have to define the global variable to be used by other rows in the table, otherwise all your values would be 1. For your purposes, you necessarily need to define rec as a global variable, and you don't need to create a function to add 1 (just use rec+=1).

Problem: You have a table of facilities, some facilities have IDs already and others do not. The ID is a numeric value that is stored in a text field, so you need to find the largest ID number that is already assigned, and keep track of the largest ID your script assigned. I'm assuming you don't want to auto-assign an ID that is already assigned to another row (making your field a primary key).

With an arcpy.da cursor you can have a query (where clause) built into the cursor, so that you're only iterating through rows that satisfy the conditions of the query (http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-data-access/updatecursor-class.htm) . Try this:

import os, sys
import arcpy

#Set variables
rec = 0
workspace = "Database Connections\FE.sde"
arcpy.env.workspace = workspace
feature_name = "FE.SManhole" #change feature class if required
field = ["FACILITYID"]
table_name = "FE.SWR_FacilityID"

### Create a list of all facility IDs that are already populated
### convert every item to integer, if the item is not ' ' or '' or None
###### I'm assuming your ids are integers, but you may have to change this
all_facility_ids = [int(x[0]) for x in arcpy.da.SearchCursor(feature_name, field) if x[0] not in [' ', '', None]]

### Sort the list of IDs (smallest to largest) and return the last item
max_id = sorted(all_facility_ids)[-1]


#loop through data with arcpy.da.UpdateCursor (way faster)
# arcpy.da cursors have the option to use a query, so we'll do that
query = "FACILITYID not in (' ', '', NULL)"
with arcpy.da.UpdateCursor(feature_name, field, query) as cursor:
    for row in cursor:
        while True: ## This part will keep adding to rec if the number is already in the list of facility IDs
            if rec in all_facility_ids:
                rec +=1
            else:
                rec +=1
                break
        row[0] = str(rec)
        cursor.updateRow(row)
del cursor

Here's another option, if you want to use an if statement to iterate through all the rows:

with arcpy.da.UpdateCursor(feature_name, field) as cursor:
    for row in cursor:
        if row[0] not in (' ', '', None):
            while True: ## This part will keep adding to rec if the number is already in the list of facility IDs
                if rec in all_facility_ids:
                    rec +=1
                else:
                    rec +=1
                    break
            row[0] = str(rec)
            cursor.updateRow(row)
        else: continue
del cursor

Best Answer

Related Solutions

[GIS] Does arcpy have a method for accessing related records within a cursor

[GIS] Auto-incrementing attribute fields using ArcPy cursors and conditional statements

Related Question