Python Shapefile – How to Find and Replace Unicode Character in a Shapefile

fionapythonshapefileunicode

I've found several example of script and blog posts on working with unicode characters, but I haven't been able to make anything works so far… It's a little frustrating

I have a shapefile encoded in utf-8 (exported from QGIS) and there's many É, È, À, Ô, etc. in the values of some fields.

For a geocoder, I need to normalize my fields values.

This is what I have so far:

# -*- coding: utf-8 -*-
###############################################################################
import sys
from sys import argv
import osgeo.ogr
from amtpy import EndOfScript
###############################################################################

script, src_shp, fld_shp = argv

#- Opening the shapefile
shapefile = osgeo.ogr.Open(src_shp)
layer = shapefile.GetLayer(0)
spatialRef = layer.GetSpatialRef()


#- Going through each feature, one by one
for i in range(layer.GetFeatureCount()):
    print "Normalisation de la ligne %i" %(i+1)
    feat = layer.GetFeature(i)

    texte_norm = feat.GetField(fld_shp)
    texte_norm = texte_norm.encode('utf-8')

    texte_norm = texte_norm.upper()
    texte_norm = texte_norm.replace(u'\00c2', 'A') #À
    texte_norm = texte_norm.replace(u'\u00C9', 'E') #É
    texte_norm = texte_norm.replace(u'\u00C8', 'E') #È
    # I've remove 16-17 characters to replace, for the example...
    texte_norm = texte_norm.decode('utf-8')

    print texte_norm
    #feat.SetField(fld_shp, texte_norm)

#- print EndOfScript message
print EndOfScript()

I've put a print instead of the SetField just so I could see if the replace works.

Then the script encounters the first unicode character, I have this error message:
'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128).

Well, I know that, since the character is a "À". I can't seem to find the right way.

Any tips?

Thanks!

Best Answer

GetField returns UTF-8 encoded strings and you'll want to decode it before you process it in any way. Then you encode the result to pass it to SetField. You've got it backwards.

Fiona (shameless plug) deals in Python unicode strings and so is simpler to use.

Unidecode (https://pypi.python.org/pypi/Unidecode) is handy for stuff like this because it will make sensible transliterations and romanizations for many languages. It looks like it would make the ones you want.

>>> from unidecode import unidecode
>>> unidecode(u'\u00c2')
'A'
>>> unidecode(u'\u00C9')
'E'
>>> unidecode(u'\u00C8')
'E'

The example below uses Natural Earth data and converts "Côte d'Ivoire" to "Cote d'Ivoire", etc, without presuming anything about the characters in the source data.

import fiona
from unidecode import unidecode

with fiona.open(
        '/Users/seang/data/ne_50m_admin_0_countries/'
        'ne_50m_admin_0_countries.shp', 'r') as source:

    # Create an output shapefile with the same schema,
    # coordinate systems. ISO-8859-1 encoding.
    with fiona.open(
            '/tmp/transliterated.shp', 'w',
            **source.meta) as sink:

        # Identify all the str type properties.
        str_prop_keys = [
            k for k, v in sink.schema['properties'].items()
                if v.startswith('str')]

        for rec in source:

            # Transliterate and update each of the str properties.
            for key in str_prop_keys:
                val = rec['properties'][key]
                if val:
                    rec['properties'][key] = unidecode(val)

            # Write out the transformed record.
            sink.write(rec)