[GIS] Loading time of complete Tiger 2013 dataset into PostGIS

performancepostgistiger

I'm loading the complete TIGER 2013 dataset into a PostGIS database and finding that the process is taking an extraordinarily long time. I was surprised by how long it took to load the TIGER 2012 dataset locally (over 60 hours), but the TIGER 2013 load is so long as to be almost unusable.

Loading a single state's worth of data (whether it be Alaska, Alabama, or Arkansas) is taking 12+ hours from my local machine on a high speed network into an Amazon RDS Postgres 9.3 instance with Postgis 2.1. Note that this doesn't include downloading the files from ftp2.census.gov, since I pre-downloaded all the files. This is all time taking to process the SHP files and insert them into PostgreSQL.

I don't seem to be throttled by local resources – my CPU is nearly idle, and my load average is hovering around 2. My network has plenty of spare bandwidth.

I can't imagine I'm throttled by network on the RDS instance, although it could be limited by CPU or memory (it's an m1.large RDS). There is literally nothing else running against the database though, so the load should be minimal.

This situation is further complicated by the fact that the default TIGER scripts for PostGIS don't lend themselves to parallelizability – they drop and create the tiger_staging schema, which makes it impossible to load states in parallel. I'm planning to modify the scripts to allow parallel load, but before I do that I'd like to know if there's a more fundamental problem.

Has anyone done a full soup-to-nuts load of TIGER 2013 data?

If so, how long did it take?

Are there any configuration parameters I should be looking at that might be throttling this process?

Best Answer

This situation is further complicated by the fact that the default TIGER scripts for Postgis don't lend themselves to parallelizability - they drop and create the tiger_staging schema, which makes it impossible to load states in parallel.

I've seen people having problems with loading data into RDS with osm2pgsql caused by latency. I've wanted to experiment with RDS, but I haven't had the time or a project that needed.

I would try, instead of using PostGIS loading scripts, using ogr2ogr. You could then try it single-threaded, and running multiple processes in parallel. This should tell you how it scales to multiple threads accessing RDS.

Related Question