[GIS] Reconciling UTF encoding for OpenStreetMap

encodingmoldovaopenstreetmap

I frankly am a bit confused when I am working outside the United States so please excuse my ignorance.

I downloaded some shapefiles for Moldova
(MoldovaData)
and had a question regarding the encoding of the attributes.

I have other data I want to link to it but many of the city names are encoded
differently. For instance I believe the attribute Bălți corresponds to
Balti (Bălți) how do I reconcile the attribute names ?

Best Answer

Shapefiles get their codepage either from the .dbf or from the .cpg file.

The .dbf file has a byte that represents DBF Language Driver ID. There's some discussion about these in an archived ArcGIS Desktop forum on forums.esri.com. There's a Microsoft Knowledge Base article Understanding Code Pages in Visual FoxPro which lists 19 DBF Language Driver IDs and their corresponding codepages.

The ArcGIS Resource Center page for Shapefile file extensions states that the .cpg is an optional file that can be used to specify the codepage for identifying the characterset to be used.

In ArcGIS, if a .cpg file is present it will take precedence over the DBF Language Driver ID in the .dbf file. This is generally preferred because the DBF Language Driver ID covers languages supported during the dBASE IV era whereas the .cpg file supports any codepage.

The Moldova shapefiles are using a UTF-8 encoding. You can only specify UTF-8 encoding using the .cpg file. Therefore you will need to create a .cpg text file for each shapefile and place either 65001 or UTF-8 in its body. For your convenience I've included the following MAKECPG.BAT batch file which you can save and run to create the .cpg files:

REM MAKECPG.BAT
ECHO 65001 > moldova_administrative.cpg
ECHO 65001 > moldova_coastline.cpg
ECHO 65001 > moldova_highway.cpg
ECHO 65001 > moldova_location.cpg
ECHO 65001 > moldova_natural.cpg
ECHO 65001 > moldova_poi.cpg
ECHO 65001 > moldova_water.cpg