Python ArcPy – Faster Methods of Finding the Smallest Number in a Field

arcpylist

Using arcgis desktop 10.3.1 I have a script which uses a search cursor to append values to a list and then use min() to find the smallest integer. The variable is then used in a script. The Feature class has 200,000 rows and the script takes a very long time to complete. Is there a way to do this quicker? At the moment I think I would just do it by hand rather than write a script due to the length of time it takes.

import arcpy
fc = arcpy.env.workspace = arcpy.GetParameterAsText(0)
Xfield = "XKoordInt"
cursor = arcpy.SearchCursor(fc)
ListVal = []
for row in cursor:
    ListVal.append(row.getValue(Xfield))
value = min(ListVal)-20
print value
expression = "(!XKoordInt!-{0})/20".format(value)
arcpy.CalculateField_management (fc, "Matrix_Z" ,expression, "PYTHON")

Best Answer

I can see several things that may be causing your script to be slow. The thing that is likely being very slow is the arcpy.CalculateField_management() function. You should use a cursor, it will by several magnitudes faster. Also, you said you are using ArcGIS Desktop 10.3.1, but you're using the old ArcGIS 10.0 style cursors, which are also much slower.

The min() operation even on a a list of 200K will be pretty quick. You can verify this by running this small snippet; it happens in the blink of an eye:

>>> min(range(200000)) # will return 0, but is still checking a list of 200,000 values very quickly

See if this is any faster:

import arcpy
fc = arcpy.env.workspace = arcpy.GetParameterAsText(0)
Xfield = "XKoordInt"
with arcpy.da.SearchCursor(fc, [Xfield]) as rows:
    ListVal = [r[0] for r in rows]

value = min(ListVal) - 20
print value

# now update
with arcpy.da.UpdateCursor(fc, [Xfield, 'Matrix_Z']) as rows:
    for r in rows:
        if r[0] is not None:
            r[1] = (r[0] - value) / 20.0
            rows.updateRow(r)

EDIT:

I ran some timing tests and as I suspected, the field calculator took almost twice as long as the new style cursor. Interestingly, the old style cursor was ~3x slower than the field calculator. I created 200,000 random points and used the same field names.

A decorator function was used to time each function (may be some slight overhead in the setup and tear down of functions, so maybe the timeit module would be a little more accurate to test snippets).

Here are the results:

Getting the values with the old style cursor: 0:00:19.23 
Getting values with the new style cursor: 0:00:02.50 
Getting values with the new style cursor + an order by sql statement: 0:00:00.02

And the calculations: 

field calculator: 0:00:14.21 
old style update cursor: 0:00:42.47 
new style cursor: 0:00:08.71

And here is the code I used (broke everything down to individual functions to use the timeit decorator):

import arcpy
import datetime
import sys
import os

def timeit(function):
    """will time a function's execution time
    Required:
        function -- full namespace for a function
    Optional:
        args -- list of arguments for function
        kwargs -- keyword arguments for function
    """
    def wrapper(*args, **kwargs):
        st = datetime.datetime.now()
        output = function(*args, **kwargs)
        elapsed = str(datetime.datetime.now()-st)[:-4]
        if hasattr(function, 'im_class'):
            fname = '.'.join([function.im_class.__name__, function.__name__])
        else:
            fname = function.__name__
        print'"{0}" from {1} Complete - Elapsed time: {2}'.format(fname, sys.modules[function.__module__], elapsed)
        return output
    return wrapper

@timeit
def get_value_min_old_cur(fc, field):
    rows = arcpy.SearchCursor(fc)
    return min([r.getValue(field) for r in rows])

@timeit
def get_value_min_new_cur(fc, field):
    with arcpy.da.SearchCursor(fc, [field]) as rows:
        return min([r[0] for r in rows])

@timeit
def get_value_sql(fc, field):
    """good suggestion to use sql order by by dslamb :) """
    wc = "%s IS NOT NULL"%field
    sc = (None,'Order By %s'%field)
    with arcpy.da.SearchCursor(fc, [field]) as rows:
        for r in rows:
            # should give us the min on the first record
            return r[0]

@timeit
def test_field_calc(fc, field, expression):
    arcpy.management.CalculateField(fc, field, expression, 'PYTHON')

@timeit
def old_cursor_calc(fc, xfield, matrix_field, value):
    wc = "%s IS NOT NULL"%xfield
    rows = arcpy.UpdateCursor(fc, where_clause=wc)
    for row in rows:
        if row.getValue(xfield) is not None:
            
            row.setValue(matrix_field, (row.getValue(xfield) - value) / 20)
            rows.updateRow(row)

@timeit
def new_cursor_calc(fc, xfield, matrix_field, value):
    wc = "%s IS NOT NULL"%xfield
    with arcpy.da.UpdateCursor(fc, [xfield, matrix_field], where_clause=wc) as rows:
        for r in rows:
            r[1] = (r[0] - value) / 20
            rows.updateRow(r)
                                    

if __name__ == '__main__':
    Xfield = "XKoordInt"
    Mfield = 'Matrix_Z'
    fc = r'C:\Users\calebma\Documents\ArcGIS\Default.gdb\Random_Points'
    
    # first test the speed of getting the value
    print 'getting value tests...'
    value = get_value_min_old_cur(fc, Xfield)
    value = get_value_min_new_cur(fc, Xfield)
    value = get_value_sql(fc, Xfield)

    print '\n\nmin value is {}\n\n'.format(value)

    # now test field calculations
    expression = "(!XKoordInt!-{0})/20".format(value)
    test_field_calc(fc, Xfield, expression)
    old_cursor_calc(fc, Xfield, Mfield, value)
    new_cursor_calc(fc, Xfield, Mfield, value)

And finally, this is what the actual print out was from my console.

>>> 
getting value tests...
"get_value_min_old_cur" from <module '__main__' from 'C:/Users/calebma/Desktop/speed_test2.py'> Complete - Elapsed time: 0:00:19.23
"get_value_min_new_cur" from <module '__main__' from 'C:/Users/calebma/Desktop/speed_test2.py'> Complete - Elapsed time: 0:00:02.50
"get_value_sql" from <module '__main__' from 'C:/Users/calebma/Desktop/speed_test2.py'> Complete - Elapsed time: 0:00:00.02


min value is 5393879


"test_field_calc" from <module '__main__' from 'C:/Users/calebma/Desktop/speed_test2.py'> Complete - Elapsed time: 0:00:14.21
"old_cursor_calc" from <module '__main__' from 'C:/Users/calebma/Desktop/speed_test2.py'> Complete - Elapsed time: 0:00:42.47
"new_cursor_calc" from <module '__main__' from 'C:/Users/calebma/Desktop/speed_test2.py'> Complete - Elapsed time: 0:00:08.71
>>>

Edit 2: Just posted some updated tests, I found a slight flaw with my timeit function.

Related Solutions

ArcPy Insert Cursor – Not Inserting All Rows Issue

I had initially neglected to delete my irow object at the end of the script; after updating my code the insert cursor appears to be adding all rows (even the last one!) the way it should be. Here is the final code:

import arcpy, os

months = {'JAN':1,'FEB':2,'MAR':3,'APR':4,'MAY':5,'JUN':6,'JUL':7,'AUG':8,
          'SEP':9,'OCT':10,'NOV':11,'DEC':12}
ignore = ['Arbitrary_count','TOTAL_MGY','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC']

ws = r'D:\Data\Users\jbellino\Project\faswam\data\water_use\SC\FromTomAb\SC_WELLS_data_jcb.mdb'
arcpy.env.workspace = ws
arcpy.env.overwriteOutput = True
tbl = 'ORIGINAL_DHEC_WELL DATA'
itbl = 'monthly_dhec_well_data'

fields = arcpy.ListFields(tbl)

rows = arcpy.SearchCursor(os.path.join(ws,tbl))
irows = arcpy.InsertCursor(os.path.join(ws,itbl))
for row in rows:
    for month in months:
        #--for each row in the original table, and for each month stored in that row, 
        # create a new record in 'itbl'
        irow = irows.newRow()
        for field in fields:
            if field.name == month:
                #--if the field name refers to a month abbreviation it contains data
                # first convert the month abbreviation to month number
                irow.cn_mo = months[month]
                try:
                    # then grab the data in the field and process it into the appropriate 
                    # fields of the new table in the correct units
                    irow.cn_qnty_mo_va = row.getValue(field.name)*1000000
                    irow.cn_qnty_mo_va_mega = row.getValue(field.name)
                except:
                    #--skip null values
                    pass
            elif field.name not in ignore:
                #--if the field name is not a month abbreviation, just copy the data
                # to the new table
                irow.setValue(field.name,row.getValue(field.name))
        irows.insertRow(irow)
del irow,irows
del row,rows

[GIS] arcpy; Select by attribute with an If, then statement

Let's say you have a list of shapefiles and the field name for the attribute is the same in each shapefile. That means you can make the SQL query in the beginning of the script, so the first bit could look like this:

search_id = arcpy.GetParameterAsText(0)

shps = [r"path\to\shp1.shp",r"path\to\shp2.shp"]
output_shp = r"path\to\output.shp"
field = "ID_FIELD_NAME"

sql = '"{0}" = \'{1}\''.format(field,search_id)

Here are two ways to go about your selection process. The first uses a where clause in the MakeFeatureLayer() and GetCount(), and the second uses a SearchCursor() to find the attribute and proceeds to MakeFeatureLayer(). I'm not sure which one is faster, but my guess is the second. Most likely, the speed is very similar if you don't have tons of shapefiles.

for shp in shps:

    # make feature layer
    if arcpy.Exists("fl"):
        arcpy.management.Delete("fl")
    fl = arcpy.management.MakeFeatureLayer(shp,"fl",sql)

    # check count of features in new feature layer, skip to next shp if it == 0
    if int(arcpy.management.GetCount(fl).getOutput(0)) == 0:
        continue

    # copy features, only happens if there is a feature in the feature layer
    arcpy.management.CopyFeatures(fl,output_shp)

and here's the second. In this one, a SearchCursor is used to check for the presence of the feature, and then the MakeFeatureLayer() is used. I think this would probably be the faster way:

for shp in shps:

    # use list comprehension with a SearchCursor to check for the id
    all_ids = [r[0] for r in arcpy.da.SearchCursor(shp,field)]
    if not search_id in all_ids:
        continue

    # now make feature layer and copy features
    if arcpy.Exists("fl"):
        arcpy.management.Delete("fl")
    fl = arcpy.management.MakeFeatureLayer(shp,"fl",sql)

    # copy features
    arcpy.management.CopyFeatures(fl,output_shp)

EDIT:

With differing field names, you can store them and iterate the shapefiles like this:

shps = [(r"path\to\shp1.shp","ID_FIELD_NAME1"),
        (r"path\to\shp2.shp","ID_FIELD_NAME2")]
output_shp = r"path\to\output.shp"

for item in shps:
    shp = item[0]
    field = item[1]
    sql = '"{0}" = \'{1}\''.format(field,search_id)

    #continue with the rest of the code as above

Best Answer

Related Solutions

ArcPy Insert Cursor – Not Inserting All Rows Issue

[GIS] arcpy; Select by attribute with an If, then statement

Related Question