Python Shapefile – How to Find and Replace Unicode Character in a Shapefile

fionapythonshapefileunicode

I've found several example of script and blog posts on working with unicode characters, but I haven't been able to make anything works so far… It's a little frustrating

I have a shapefile encoded in utf-8 (exported from QGIS) and there's many É, È, À, Ô, etc. in the values of some fields.

For a geocoder, I need to normalize my fields values.

This is what I have so far:

# -*- coding: utf-8 -*-
###############################################################################
import sys
from sys import argv
import osgeo.ogr
from amtpy import EndOfScript
###############################################################################

script, src_shp, fld_shp = argv

#- Opening the shapefile
shapefile = osgeo.ogr.Open(src_shp)
layer = shapefile.GetLayer(0)
spatialRef = layer.GetSpatialRef()


#- Going through each feature, one by one
for i in range(layer.GetFeatureCount()):
    print "Normalisation de la ligne %i" %(i+1)
    feat = layer.GetFeature(i)

    texte_norm = feat.GetField(fld_shp)
    texte_norm = texte_norm.encode('utf-8')

    texte_norm = texte_norm.upper()
    texte_norm = texte_norm.replace(u'\00c2', 'A') #À
    texte_norm = texte_norm.replace(u'\u00C9', 'E') #É
    texte_norm = texte_norm.replace(u'\u00C8', 'E') #È
    # I've remove 16-17 characters to replace, for the example...
    texte_norm = texte_norm.decode('utf-8')

    print texte_norm
    #feat.SetField(fld_shp, texte_norm)

#- print EndOfScript message
print EndOfScript()

I've put a print instead of the SetField just so I could see if the replace works.

Then the script encounters the first unicode character, I have this error message:
'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128).

Well, I know that, since the character is a "À". I can't seem to find the right way.

Any tips?

Thanks!

Best Answer

GetField returns UTF-8 encoded strings and you'll want to decode it before you process it in any way. Then you encode the result to pass it to SetField. You've got it backwards.

Fiona (shameless plug) deals in Python unicode strings and so is simpler to use.

Unidecode (https://pypi.python.org/pypi/Unidecode) is handy for stuff like this because it will make sensible transliterations and romanizations for many languages. It looks like it would make the ones you want.

>>> from unidecode import unidecode
>>> unidecode(u'\u00c2')
'A'
>>> unidecode(u'\u00C9')
'E'
>>> unidecode(u'\u00C8')
'E'

The example below uses Natural Earth data and converts "Côte d'Ivoire" to "Cote d'Ivoire", etc, without presuming anything about the characters in the source data.

import fiona
from unidecode import unidecode

with fiona.open(
        '/Users/seang/data/ne_50m_admin_0_countries/'
        'ne_50m_admin_0_countries.shp', 'r') as source:

    # Create an output shapefile with the same schema,
    # coordinate systems. ISO-8859-1 encoding.
    with fiona.open(
            '/tmp/transliterated.shp', 'w',
            **source.meta) as sink:

        # Identify all the str type properties.
        str_prop_keys = [
            k for k, v in sink.schema['properties'].items()
                if v.startswith('str')]

        for rec in source:

            # Transliterate and update each of the str properties.
            for key in str_prop_keys:
                val = rec['properties'][key]
                if val:
                    rec['properties'][key] = unidecode(val)

            # Write out the transformed record.
            sink.write(rec)

Related Solutions

[GIS] select column in csv file in Python

For this CSV file, simple-csv.csv:

34.79038,-96.80871,"4/13/1983"
34.93032,-96.44490,"2/5/1967"
34.95507,-96.92268,"12/23/2001"
34.95689,-96.92263,"8/9/1999"
34.92559,-96.68021,"8/25/1954"

This code will open it up and print it out:

>>> # import csv module
>>> import csv
>>> # open and read the csv file into memory
>>> file = open(‘C:/testing/simple-csv.csv’)
>>> reader = csv.reader(file)
>>> # iterate through the lines and print them to stdout
>>> # the csv module returns us a list of lists and we
>>> # simply iterate through it
>>> for line in reader:
...     print line
...
[‘34.79038’, ‘-96.80871’, ‘4/13/1983’]
[‘34.93032’, ‘-96.44490’, ‘2/5/1967’]
[‘34.95507’, ‘-96.92268’, ‘12/23/2001’]
[‘34.95689’, ‘-96.92263’, ‘8/9/1999’]
[‘34.92559’, ‘-96.68021’, ‘8/25/1954’]

If you wanted to only get the first and second columns, do something like:

for line in reader:
    print line[0], line[1]

QGIS – How to Avoid UnicodeEncodeError When Using Geoprocessing Tools

The problem is in the python code. The following line of the error message highlights the problem:

"C:/Users/Gidi/.qgis//python/plugins\layers_by_field\layers_by_field_dialog.py", line 146, in split self.vlayer = QgsVectorLayer(vProvider.dataSourceUri(), str(layer.name()) + "_" + str(uValues[j]), "ogr") UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 7: ordinal not in range(128) Python version: 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] QGIS

The method str() is used there to generate a string. Instead, the method unicode() should be used to generate a unicode string. Note: this only applies to QGIS 2, starting from QGIS 3 this no longer is applicable due to the usage of Python 3 where str already is unicode by default.

self.vlayer = QgsVectorLayer( vProvider.dataSourceUri(), unicode(layer.name()) + "_" + unicode(uValues[j]), u'ogr')

If it is not your own plugin, please file a bug or contact the author.

Edit: I just checked.Here is the link to the bugtracker for this plugin.

Best Answer

Related Solutions

[GIS] select column in csv file in Python

QGIS – How to Avoid UnicodeEncodeError When Using Geoprocessing Tools

Related Question