[GIS] How to create/write a Shapefile with UTF-8 encoded attributes with ogr2ogr

dbfogr2ogrshapefileutf-8

I'm using GDAL/OGR's ogr2ogr command line tool to export data from a PostGIS-enabled PostgreSQL database to various GIS file formats, Shapefile amongst them. To create a shapefile with the default encoding (ISO8859-1 a.k.a. latin1, see Which character encoding is used by the DBF file in shapefiles?), I'm using a command line like this:

ogr2ogr \
    -f 'Esri Shapefile' \
    $OUTPUT_PATH \
    -t_srs $OUT_SRS \
    "PG:dbname=${DB_NAME} user=${DB_USERNAME} password=${DB_PASSWORD} schemas=${SCHEMA_NAME}"

But in my data, there may be features with arbitrary languages and scripts in the attribute values. I'd like to preserve these values and thus think that the *.dbf file of the exported Shapefile should be UTF-8 encoded. (The attribute names are guaranteed to be within 7-bit ASCII.)

How can I get ogr2ogr to write a UTF-8 encoded *.dbf file when exporting to Shapefile? The documentation of the GDAL/OGR "ESRI Shapefile / DBF" driver is explicitly ambigous (sic!) about the ENCODING option in the "Layer Creation Options":

The default value is "LDID/87". It is not clear what other values may be appropriate.

And will the used encoding be indicated in the *.dbf file itself or in an accompanying *.cpg file? (I guess the latter, as UTF-8 probably isn't per se a valid (dBASE DBMS) DBF encoding.) If the latter, will ogr2ogr create the *.cpg file or do I have to create it manually?

Best Answer

How can I get ogr2ogr to write a UTF-8 encoded *.dbf file when exporting to Shapefile?

Similar to How to encode shapefiles from LATIN1 to UTF-8?, this is possible with -lco ENCODING=UTF-8. So for my case

ogr2ogr \
    -f 'Esri Shapefile' \
    $OUTPUT_PATH \
    -t_srs $OUT_SRS \
    "PG:dbname=${DB_NAME} user=${DB_USERNAME} password=${DB_PASSWORD} schemas=${SCHEMA_NAME}" \
    -lco ENCODING=UTF-8

And will the used encoding be indicated in the *.dbf file itself or in an accompanying *.cpg file? (I guess the latter, as UTF-8 probably isn't per se a valid (dBASE DBMS) DBF encoding.)

The latter, indeed: In *.cpg files (one per database table), each just having the content

UTF-8

If the latter, will ogr2ogr create the *.cpg file or do I have to create it manually?

ogr2ogr will create the *.cpg files for you.