[GIS] Improving Python script processing speed (performance)

arcgis-9.3arcpyperformance

I've been working on a script which essentially reads in a large amount of point and polygon data and generates frequency values for a featureclass of 2.5 acre hexes covering California. I have found it necessary to write my own version of the frequency tool, as ESRI's default tool has been crashing ArcMap on an irregular basis, and this script needs to be reliable and predictable. To do this I have been using python dictionaries to store values that are read in from the input datasets. The dictionaries are 2 dimensional (each key contains another dictionary within it) and get quite extensive, as there are 60000+ keys with 1-28 secondary keys. Writing the dictionaries isn't the problem, however, those get written reasonably quickly.

My problem is when the dictionaries are done being constructed, running my makeshift frequency tool takes a long time. Here's the code and a short example of the dictionary entries, I'll explain a bit more below:

There are two dictionaries I read from, here's an example of 'otherSourceDict'. hexDict is handled a little differently (it uses two letter codes (AA, AB, etc…) instead of taxon names like 'Amphibian'), but basically stores the same information:

{13456: {'Amphibian': 2, 'Bird': 5, 'Mammal': 10, 'Plant': 20}, 43156: {'Fish': 1, 'Plant': 4}}

The primary keys (13456 and 43156) represent hex feature unique IDs. The secondary keys (Amphibian, Mammal…) contain the number of taxon observations that are in that hex.

My frequency code:

#Build a list of important correlated values for reference later
taxaList = [['RAR2_AMPH', 'AA', 'Amphibian', 'R2_NRM_A'],
            ['RAR2_BIRD', 'AB', 'Bird', 'R2_NRM_B'],
            ['RAR2_FISH', 'AF', 'Fish', 'R2_NRM_F'],
            ['RAR2_MAMM', 'AM', 'Mammal', 'R2_NRM_M'],
            ['RAR2_REPT', 'AR', 'Reptile', 'R2_NRM_R'],
            ['RAR2_PLNT', 'P', 'Plant', 'R2_NRM_P']]
#Add fields for taxa value by hex
sendmsg("     Adding fields...")
fieldList = gp.listfields(inRare)
for taxa in range(len(taxaList)):
    if not taxaList[taxa][0] in fieldList:
        gp.addfield_management(inRare, taxaList[taxa][0], "SHORT")
cur = gp.updatecursor(inRare)
row = cur.next()
sendmsg("     Writing dictionaries to hexs")
c=0
while row:
    #Count for user visualization of process completion
    c+=1
    if c%10000 == 0:
        sendmsg("          Starting record number " + str(c) + "...")
    #For each taxa, calculate the frequency and populate the corresponding field.
    for taxa in range(len(taxaList)):
        val=0
        if str(int(row.HEX25_ID)) in hexDict.keys():
            #Search the CNDDB dictionary and count how many of each taxa per hex
            for elmcode in hexDict[str(int(row.HEX25_ID))].keys():
                #If dealing with taxa other than plants
                if elmcode != "ECO_SECT" and taxaList[taxa][1] != "P":
                    if elmcode[:2] == taxaList[taxa][1]:
                        val+=1
                #If dealing with plants
                elif elmcode != "ECO_SECT" and taxaList[taxa][1] == "P":
                    if elmcode[0] == "P" or elmcode[0] == "N":
                        val+=1
        if str(int(row.HEX25_ID)) in otherSourceDict.keys():
            #Search the other source dictionary and count how many of each taxa per hex
            for otherTaxa in otherSourceDict[str(int(row.HEX25_ID))].keys():
                if otherTaxa == taxaList[taxa][2]:
                    val+=int(otherSourceDict[str(int(row.HEX25_ID))][otherTaxa])
        #Populate the total number of each taxa per hex to the rare species hex features
        row.setValue(taxaList[taxa][0], val)
    cur.updaterow(row)
    row = cur.next()

del row, cur

So right now, I have it so that when it goes to populate the hex features, it loops through each row 5 times, one for each field, then runs the frequency calculation for the specified taxon, and populates the field. There are roughly 60000 rows that it does this for.

Does anyone have any suggestions or alternative methods for populating the hex features that could speed up the process? The script performs several operations, this by far takes the most time (about 25 minutes) and I would like to get it to be as quick as possible.

Best Answer

I think there may be unavoidable cursor transaction overhead to slow you down unless there is a way to update a large batch of rows at once. Comment out "cur.updaterow(row)" and run it again... is there a difference?

The secondary slow down in your case is a lot of unnecessary copying. dict.keys() copies values and you have many. Better to do "if k in dict" and "for k in dict" which compares your value k to the dict's keys.

Can you also avoid all the str(int(foo)) expressions? If foo is already a string representation of an int, you can save many (millions?) of calls.

Related Question