[GIS] quicker way to spatial join two feature classes

arcgis-10.1arcgis-desktoparcpy

I have two large feature classes (3.5 mil point and 3.8 mil poly) that I would like spatially join.

Is there a faster way to do this instead of just using the classic spatial join tool?

I'm guessing this is a repeat question just can't find it.

When joining feature classes of this size I usually just let it run for a few hours to complete.

In the future, I would like to find a way to turn these types of processes from hours to minutes.

I only pull over one unique ID field to create a foreign key and then relate the two tables.

Just looking for advice or ideas.

One to One Join
Intersect

Arcpy friendly
Arcgis 10.1
Processor: Intel(R) Xeon(R) CPU E5-1607 0 @ 3.00 GHz
RAM: 16 GB
System type: 64-bit Operating System

Best Answer

Possible solution using rasters:

  • Split total extent using fishnet
  • Set environment cell size
  • a) start with first tile, i.e. n = 1
  • b) Convert polygons and points to raster
  • Calculate zonal statistics (points grid = zones, polygons grid = values)
  • Count number of records in statistics table. If less than count of points, reduce cell size and go to b), otherwise transfer data from statistics table to points table, proceed with next tile, i.e. n+=1 go to a)

I tested the script below using 2.4 million points and the same number of polygons. I used fishnet equal 5*5 tiles. It means that about 100 thousands points were processed at a time. This is a snapshot of result window:

Result window

I tried spatial join on the same datasets and when it get to 5%, I cancelled it. By this time tool was running for 8 minutes. This means that with truly massive datasets (few millions), raster approach could greatly reduce processing time.

Note that I used raster to create my points and polygons, i.e. tested polygons were the most basic squares. I expect the advantage of raster approach can be even greater with more complex shapes. I could of reduce processing time further, by using fishnet with less tiles, but didn't want to risk in_memory space corruption.

SCRIPT:

import arcpy, time, os
from arcpy import env
from arcpy.sa import ZonalStatisticsAsTable as ZS
env.overwriteoutput=True
t0=time.time()
## PARAMETERS
workGDB=r'..\Scratch.gdb'
fish_net=r'D:\Scratch\fish_net.shp'
points=r'..\Scratch.gdb\points'
polygons=r'..\Scratch.gdb\polygons'
pointsGrid=r'in_memory\pntsGR'
polygonsGrid=r'in_memory\pgonsGR'
stats=r'D:\Scratch\stats.dbf'
##allStats=workGDB+os.sep+'allStats'

G=arcpy.Geometry()
tiles=arcpy.CopyFeatures_management(fish_net, G)
nTot=0
cSize=4
aDict={}
for n,tile in enumerate(tiles):
    arcpy.AddMessage("Processing tile no %i out of %i" %(n+1,len(tiles)))
    env.extent=tile.extent
    result=arcpy.GetCount_management(points)
    nPoints=int(float(result.getOutput(0)))
    while True:
        env.cellSize = cSize
        arcpy.arcpy.PointToRaster_conversion(points, "POINTID",pointsGrid)
        arcpy.FeatureToRaster_conversion(polygons, "ID",polygonsGrid)
        outZSaT = ZS(pointsGrid, "Value",polygonsGrid,stats,"DATA", "MINIMUM")
        result=arcpy.GetCount_management(stats)
        nStats=int(float(result.getOutput(0)))
        if nStats>=nPoints:break
        cSize=float(cSize)/2
    arcpy.AddMessage("Updating lookup ....")
    arcpy.SetProgressor("step", "", 0, nPoints)
    with arcpy.da.SearchCursor(stats,("Value","MIN")) as cursor:
        for v,m in cursor:
            aDict[v]=m
            arcpy.SetProgressorPosition()
    nTot+=nPoints
    arcpy.AddMessage("Processed %i points" %nTot)

arcpy.Delete_management("in_memory")
arcpy.AddMessage("Transferring results ...")
arcpy.SetProgressor("step", "", 0, nTot)
with arcpy.da.UpdateCursor(points,("pointid","PGONID")) as cursor:
    for row in cursor:
        row[1]=aDict[row[0]]
        cursor.updateRow(row)
        arcpy.SetProgressorPosition()
arcpy.AddMessage("Total time %i sec" %int(time.time()-t0))