[GIS] Limit to number of items using ‘IN’ SQL syntax in arcmap/arcpy

arcgis-10.3arcpysql

I'm writing a python script that uses 'NOT IN ('a', 'b', ...)' statements on very large lists (like comparing 500 items in one feature class against >10k items in another) and am getting weird unexpected results (sometimes it works, sometimes it doesn't and it corrupts the geodatabase…?!?).

I've looked and can't find any ArcMap/arcpy documentation on how large a list can be using IN/NOT IN syntax and was wondering if anyone knew. Using ArcGIS 10.3.1 with 64-bit python.

Update: All of the data in question is stored in file geodatabases.

Best Answer

The limit imposed by the Select By Attributes window in ArcMap is 26114 symbols. I have just generated a list of integers and pasted in the dialog box:

range_length = 4000
selection_tuple = str(tuple([x for x in range(range_length)]))

Now just paste into the dialog box and see how it is truncated (... replaces the sequence, just to keep it short for posting here):

OBJECTID in (0,1,2,3,4...3884,3885,3886,3)

Regarding Python, I am able to execute this code both in Python 32 and 64 bit. Tests done on ArcGIS 10.3.1. Doing selection both with Make Feature Layer and Select Layer By Attributes having in the selection tuple 100K+ items is no problem at all.

import arcpy
import random

range_length = 210000
values = [x for x in range(range_length)]
random.shuffle(values) #get randomly sorted values
selection_tuple = str(tuple(values))

print('selection length feats',range_length)
fc = r'C:\GIS\temp.gdb\Streets' #400K+ lines

arcpy.env.workspace = r"C:\GIS\Temp\ArcGISHomeFolder\Default.gdb"
where_clause = '''ReqField IN {0}'''.format(selection_tuple)

#either works fine
#fl = arcpy.MakeFeatureLayer_management(fc,where_clause=where_clause)
fl = arcpy.MakeFeatureLayer_management(fc) #takes 3-5 secs
arcpy.SelectLayerByAttribute_management(fl,"NEW_SELECTION",where_clause)

l = fl.getOutput(0)
print('feats count:', int(arcpy.GetCount_management(l).getOutput(0)))
arcpy.CopyFeatures_management(l,"range_{}".format(range_length))

One thing to note is that having an index calculated on your field will speed up significantly the selection (this is what I've found from my tests). So, check whether you have the index calculated on the field you are doing selection.

In terms of performance, the Make Feature Layer with the IN statement in the where clause takes the following time to execute (the fc with 2.5mln rows is used with randomly generated numbers in range (1,10K):

range_length = 2000000 #13sec on Python 32bit
range_length = 1000000 #7sec
range_length = 500000 #3sec
range_length = 2500000 #11sec on Python 64bit
Related Question