How to import OpenStreetMap data into a Spark GraphX RDD?
Would like to see feasibility of doing drive-time analysis in our own Spark cluster.
Thought to use osm2pgsql or a similar project, but not sure if preserves data required for routing.
Looked at OSM2Routing, but it's not clear what config file required e.g. for whole U.S. map?
I'm new to both OSM and GraphX.
Update. GraphX needs two tables. 1st – Vertex Table (id of a node, properties); 2nd – Edge Table (SrcId, DstID, properties). https://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph
Best Answer
There are a couple of things coming to my mind here:
One thing that - possibly - provides a solution for getting OSM data in an HPC cluster (I am ignoring the in-memory versus disk-based here, as I am not really sure it is relevant), is the ESRI route, because I know ESRI also has been working on a solution to put GIS data in an HPC cluster and to extend Hadoop with OGC-based spatial functions. So one possible route might be something like:
1) Import OSM data into a File Geodatabase, or Oracle / SQL Server Enterprise Geodatabase, using the Load OSM File tool part of the ArcGIS Editor for OpenStreetMap. An *.osm file for the entire US can be downloaded from Geofabrik.
(- 1b) Possibly use the Create OSM Network Dataset tool of this toolbox to create a routable network).
I have done none of this ever myself, so bear with me if it is somewhat speculative, but I know the tools are there...