[GIS] way to autocorrect miss-spelled cities for geocoding

geocoding

I could not geocode a subset of my dataset (around 140,000 observations) which contains full postal addresses from Germany, i.e. zip code, city and street. Apparently, the geocoder failed to convert the addresses due to mistakes in the data. Eyeballing led me to the impression that the most prevalent reasons are the following

  • minor spelling mistakes in terms of the city name. Example: "Neunburg v. Wald" instead of "Neunburg vorm Wald" or "Hessheim" instead of "Heßheim"
  • missing part of the name. Example: "Pohlheim" instead of "Pohlheim-Watzenborn"
  • wrong match between zip code and city name (maybe because the address was recorded before a change in the zip code occured, e.g. two zip codes were merged)

This is why I would like to "auto-complete" / "auto-correct" the city names. I could imagine to do the following: If I had access to a database that contains all zip codes and cities for Germany, I would make a list of "suggestions" among which I choose the one that is "closest" to the wrong city name. The way one could go about the list of suggestions is to match all cities that have the same plz (not a one-to-one mapping) or match those cities from the database that start with the same first x characters (under the assumption that the error does not occur in that range). Then, one could pick the closest string (city name) based on an approximate string match algorithm like the Levenshtein algorithm.

This leads me to two subquestions:

  • Is there a possiblity to extract this information (zip codes and city names) from the OSM database (I do not have an installed instance though)?
  • Should I rather rely on service providers like Google?

Best Answer

About OSM-based geocoding you can have a closer look at the wiki article Search Engines where I recommend Pelias or Photon ... maybe both have an API. Maybe theit fuzzy search can help you?

And about extracting data from OSM main database: We have complete and up-to date definitions of administrative and postalcode boundaries for whole Germany.

Give us an example what data you want to extract in one run or via a batch command, and we can try to form a query that you can start via overpass-api or overpass-turbo.eu

Related Question