[GIS] How to import OpenStreetMap data into GraphX

openstreetmaposm2pgroutingrouting

How to import OpenStreetMap data into a Spark GraphX RDD?

Would like to see feasibility of doing drive-time analysis in our own Spark cluster.

Thought to use osm2pgsql or a similar project, but not sure if preserves data required for routing.

Looked at OSM2Routing, but it's not clear what config file required e.g. for whole U.S. map?

I'm new to both OSM and GraphX.

Update. GraphX needs two tables. 1st – Vertex Table (id of a node, properties); 2nd – Edge Table (SrcId, DstID, properties). https://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph

Best Answer

There are a couple of things coming to my mind here:

  • First, and foremost, why you feel the need to use an unproven HPC routing solution to do something for which the OSM community already developed proven and tested solutions? Routing on OSM data has been implemented as several Open Source projects (See http://wiki.openstreetmap.org/wiki/Routing. Even ESRI has a routing (or routing network generation) capability build into their Open Source ArcGIS Editor for OpenStreetMap. Each of these run on ordinary hardware and ordinary configurations as far as I can tell, not requiring extensive clusters. Even the main OpenStreetMap website now features routing, disproving the entire need for something as complex as an HPC solution (and now I am even forgetting 10 year old commercial car routing computers that already could do the job). It sounds to me a bit like trying to catch a mouse with a shotgun...
  • Extending on this: most OSM users don't have access to something as fancy as a HPC cluster. This essentially means that you are probably "on your own" with this question, and probably means you will need to dig deep and work hard to solve this, if you really intend to do this as some kind of research project. Your current question slightly raises doubt to me if you actually did your "homework" yet... You can't start a project like this if you are "New"... Start reading everything you can about OSM and GraphX!
  • One thing that - possibly - provides a solution for getting OSM data in an HPC cluster (I am ignoring the in-memory versus disk-based here, as I am not really sure it is relevant), is the ESRI route, because I know ESRI also has been working on a solution to put GIS data in an HPC cluster and to extend Hadoop with OGC-based spatial functions. So one possible route might be something like:

    • 1) Import OSM data into a File Geodatabase, or Oracle / SQL Server Enterprise Geodatabase, using the Load OSM File tool part of the ArcGIS Editor for OpenStreetMap. An *.osm file for the entire US can be downloaded from Geofabrik.

    • (- 1b) Possibly use the Create OSM Network Dataset tool of this toolbox to create a routable network).

    • 2) Export data to JSON in the Hadoop cluster using the ESRI Geoprocessing tools for Hadoop toolset
    • 3) Possibly make use of the ESRI Geometry API for Java, developed for use on Hadoop, to do more fancy stuff on your Hadoop stored data, like building a routable network

I have done none of this ever myself, so bear with me if it is somewhat speculative, but I know the tools are there...

Related Question