I am currently writing a python script using the arcgisscripting module to process a reasonably large data set (~10,000 records in total) normalised over a small number of tables, 8 in total. The process consists of creating a feature based on coordinate tuples (x,y) and creating a graph (nodes and lines) using the relationships in the other 7 tables for guidance. The final output is a personal geodatabase (pgdb/fgdb) with nodes and edges spatial data sets that visually represent the relationships.
My initial attempt was to use queries of the new geodatabase tables and SearchCursor record sets to populate link tables (InsertCursor) for the many-to-many relationships that occur. This worked very well, except for the 15-20 min processing time.
Using the cProfiler module in Python, it was apparent that 'thrashing' a personal geodatabase when performing the search queries to populate the link tables with requests for cursors (Search and Insert cursors) caused the appalling performance.
With a little refactoring I have managed to get the processing time below 2.5 minutes. The trade off was partial construction of the geodatabase schema in code and limiting the requests for arcgisscripting cursors to InsertCursors once all relationships were collated.
My question is one of performance;
- What techniques have people used to maintain reasonable compute times when working with large data set?
-
Are there any ESRI recommended methods that I've missed in my search for optimisation?
I understand the overhead incurred when creating a arcgisscripting cursor, particularly if it is from a personal geodatabase, though after a lengthy search for performance related answers from this site and Google I am under the impression that performance isn't at the forefront of peoples endeavours.
- As a user of ESRI products, does one expect and condone these performance lags?
UPDATE
After some work with this product I have accumulated a list of optimization techniques that have taken a process of converting spatial information from a propriety format to a geodatabase. This has been developed for personal and file geodatabase.
Tidbits:
Read you data and rationalize it in memory. This will cut your time in half.
Create feature classes and tables in memory. Use the feature dataset keywork 'in_memory' to use your memory as ram disk, perform your functions there and then write out to disk
To write out to disk use the CopyFeatureclass for feature classes, and CopyRow for tables.
These 3 things took a script that converted 100,000+ features to a geodatabase from 30 minutes to 30 – 40 seconds, this includes relationship classes. They are not to be used lightly, most of the methods above use a lot of memory, which could cause you issues if you are not paying attention.
Best Answer
Although this question was already answered, I thought I could chime in an give my two cents.
DISCLAIMER: I worked for ESRI at the GeoDatabase team for some years and was in charge of maintaining various parts of GeoDatabase code (Versioning, Cursors, EditSessions, History, Relationship Classes, etc etc).
I think the biggest source of performance problems with ESRI code is not understanding the implications of using different objects, particularly, the "little" details of the various GeoDatabase abstractions! So very often, the conversation switches to the language being used as a culprit of the performance issues. In some cases it can be. But not all the time. Let's start with the language discussion and work our way back.
1.- The programming language that you pick only matters when you are doing something that is complicated, in a tight loop. Most of the time, this is not the case.
The big elephant in the room is that at the core of all ESRI code, you have ArcObjects - and ArcObjects is written in C++ using COM. There is a cost for communicating with this code. This is true for C#, VB.NET, python, or whatever else you are using.
You pay a price at initialization of that code. That may be a negligible cost if you do it only once.
You then pay a price for every subsequent time that you interact with ArcObjects.
Personally, I tend to write code for my clients in C#, because it is easy and fast enough. However, every time I want to move data around or do some processing for large amounts of data that is already implemented in Geoprocessing I just initialize the scripting subsystem and pass in my parameters. Why?
Ah yes, so then the solution if to use a lot of geoprocessing functions. Actually, you have to be careful.
2. GP is a black box that copies data (potentially unnecessarily) around
It is a doubled-edged sword. It is a black box that does some magic internally and spits out results - but those results are very often duplicated. 100,000 rows can easily be converted into 1,000,000 rows on disk after you ran your data through 9 different functions. Using only GP functions is like creating a linear GP model, and well...
3. Chaining too many GP functions for large datasets is highly inefficient. A GP Model is (potentially) equivalent to executing a query in a really really really dumb way
Now don't get me wrong. I love GP Models - it saves me from writing code all the time. But I am also aware that it is not the most efficient way of processing large datasets.
Have you every heard of a Query Planner? It's job is to look at the SQL statement you want to execute, generate an execution plan in the form of a directed graph that looks a heck of a lot like a GP Model, look at the statistics stored in the db, and choose the most optimal order to execute them. GP just executes them in the order you put things because it doesn't have statistics to do anything more intelligently - you are the query planner. And guess what? The order in which you execute things is very dependent on your dataset. The order in which you execute things can make the difference between days and seconds and that is up to you to decide.
"Great" you say, I will not script things myself and be careful about how I write stuff. But do you understand GeoDatabase abstractions?
4.Not understanding GeoDatabase abstractions can easily bite you
Instead of pointing out every single thing that can possibly give you a problem, let me just point out a few common mistakes that I see all the time and some recommendations.
5.And last and not least...
Understand the difference between I/O bound and CPU bound operations
I honestly thought about expanding more on every single one of those items and perhaps doing a series of blog entries that covers every single one of those topics, but my calendar's backlog list just slapped me in the face and started yelling at me.
My two cents.