[GIS] Alternative approach for loading osm2pgsql data

openstreetmaposm2pgsqlpostgresql

I am slowly going insane trying to successfully complete an OSM data import and I would like to find an alternative approach. What I'm attempting is frustratingly simple yet fails repeatedly.

Originally I attempted to import the latest planet .osm.bz2 file in slim mode but this failed with a parsing error. Then I tried to import the .pbf in slim mode and this was moving so slowly that it would have taken three months. Then I got the memory in my VM increased to 48GB and removed the -s slim mode argument, and the import failed when it ran out of memory. Finally I attempted to import the pbf in slim mode with a 44GB node cache and approximately half way through the nodes, after working for several days, it failed with Zlib compression failed (code -3, incorrect data check).

There has to be a simpler, faster, and more reliable way to do this. Ideally I would like to simply download someone else's Postgres data file, attach it to my database, sit back and relax.

I need the data to be in the osm2pgsql schema but I really don't care how it gets there, as long as it's no more than a couple of months old. Does anyone know why this is so complicated, and how to make it simpler?

Best Answer

The osm2pgsql documentation suggests osm2pgsql -c -d gis --slim -C <cache size> --flat-nodes <flat nodes> planet-latest.osm.pbf.

where

  • <cache size> is 24000 on machines with 32GiB or more RAM or about 75% of memory in MiB on machines with less
  • <flat nodes> is a location where a 24GiB file can be saved.

This works, and on a fast machine will take under a day.

Because you have a lot of RAM, I'd go with 28000 cache. There is no point in a cache of 35000, as that is more than is needed to store every node.

You should

  • use PBF, not bzipped XML

    Given your previous problem, also download the md5 file and checksum the file with md5sum -c, e.g. md5sum -c planet-latest.osm.pbf.md5. This will ensure there wasn't corruption downloading or saving to disk

  • use slim mode

    As the earlier linked page states, in non-slim there is also a way cache that takes up about as much RAM as the node cache, so you cannot do a non-slim import of the planet with 48GB of RAM.

  • decide on an update strategy

    At some point, you will want to update your database. If you plan to do it infrequently, you'll probably just drop the database and reload the latest data. If this is your plan, add the --drop flag to your command line and the slim tables will not be indexed and instead dropped, saving over 100GB and hours of time. If using drop, you can also delete the flat nodes file.

    If you plan on updating using minutely, hourly, or daily diffs, then you can't use --drop.

  • Specify the number of CPU cores to be used, with --number-processes.

    This will help the "pending ways" section of the import. Recent (0.88.x) versions of osm2pgsql default to the number of hardware threads, but older versions default to 1.

You've mentioned using a VM. Many VMs have extremely slow disks when doing random IO, which may limit your speed.

Related Question