[GIS] Multiprocessing issues with ArcPy

arcgis-10.1arcpyparallel processingspatial-analyst

I am using ArcGIS 'watershed' routine in a script which makes use of multiprocessing.
This works fine, but during execution I get the message:

Unable to remove directory.  Possible causes:
1- Not owner of the directory
2- Another person or application is accessing this directory

EDIT 27/02/14:
I've incorporated the suggestions of using arcpy.Exists (with a very rudimentary way of checking if something didnt exist) and writing all results to disk:

import arcpy
from arcpy import sa
arcpy.CheckOutExtension("Spatial")

def watershed(pnts, flowdir, flowacc):
    direc = tempfile.mkdtemp(dir = "C:\\temp")   #Create a separate directory for files to be written to.
    arcpy.env.scratchWorkspace = direc
    arcpy.env.workspace = direc

    res = []
    for i, p in enumerate(pnts)
        pnt = arcpy.PointGeometry(arcpy.Point(p.x, p.y, ID = i)) # Convert the Shapely point to arcpy point
        pourpt = sa.SnapPourPoint(pnt, flowacc, 10000) #Snap point to high flow acc cell
        if arcpy.Exists(pourpt):
            ws = sa.Watershed(flowdir, pourpt) # Calculate watershed
            if arcpy.Exists(ws):
               out = os.path.join(direc, "poly_%i"%i)
               poly = arcpy.RasterToPolygon_conversion(ws, out) #Convert to polygon
               res.append(poly[0]) # Put the polygon in the results list for this set of points
            else polylist = "NoWS"
        else polylist = "NoPourPt"

    return res

The parallel function remains the same:

def watershed_pll(mod, proc=6):
    """
    Calculate the watershed for each station point using parallel processing 
    All relevant data is held in the mod object
    """
    pool = Pool(processes = proc)

    # Iterate over each feature of the drainage path and submit the list of pour points to the worker function
    jobs = []
    for key, val in mod.stations.geometry.iteritems():
        jobs.append((key, pool.apply_async(watershed, (val.points, key, mod.flowdir,
                                           mod.flowacc, rdict))))

    pool.close()
    pool.join()
    return jobs

Sometimes it runs perfectly, other times I get one of various errors such as the initial one above, or also:

FATAL ERROR (INFADI)
MISSING DIRECTORY

and:

ERROR 010088: Invalid input geodataset (Layer, Tin, etc.).]

and:

 ERROR 010050: Cell size is not set.

Which is odd as I explicitly set the cell size within the function…

When it works and when it doesnt seems to be hit and miss/random. Such errors lead me to believe that arcpy is still trying to delete stuff behind the scenes. Is there any way to explicitly prevent it from doing so?


I wonder if my problem could be related to one of two things:

  1. Environments/folders getting mixed up. The input data is passed as complete filepaths to some raster datasets which are outside of the scratchWorkspace (which is set locally for each process) where intermediate/output data is created. However, I have noticed that Arc may make folders (typically and 'info' folder) in the directories of the input data. Why is that? Can it be prevented, or can I prevent Arc from then trying to delete it?

  2. Are there any potential problems with accessing the input raster data sets at the same time? I.e each process will be attempting to open and read from the flow direction and flow accumulation rasters which are passed into the function.

Best Answer

Some general steps for debugging this kind of problem...

  • Set the current workspace, and the scratch workspace to something random and specific to the process.
  • Use actual output files instead of variables. You will be able to see what has been created, and avoid the use of automatically generated file names.
  • Some spatial analyst functions (RasterToPolygon) need very short file names to work properly.
Related Question