[GIS] How to preserve non-ASCII characters in ogr2ogr import

importogr2ogrsql server

I want to import a list of international place names and their locations into a SQL Server database using ogr2ogr.

The source data is a CSV file encoded in UTF-8.

I've prepared a five-record sample file called london_points.csv. It contains the records for London in Arabic, Czech, English. Russian, and Chinese. It looks like this:

LanguageCode,Name,Longitude,Latitude
"ar","لندن","-0.143555","51.577222"
"cz","Londýn","-0.143555","51.577222"
"en","London","-0.143555","51.577222"
"ru","Лондон","-0.143555","51.577222"
"zw","倫敦","-0.143555","51.577222"

To import the CSV file using ogr2ogr I created a VRT file called london_points.vrt. It tells ogr2ogr to import the data and derive a point column from the Longitude and Latitude columns. It looks like this:

<OGRVRTDataSource>
    <OGRVRTLayer name="london_points">
        <SrcDataSource>london_points.csv</SrcDataSource>
        <GeometryType>wkbPoint</GeometryType>
        <LayerSRS>WGS84</LayerSRS>
        <GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude"/>
    </OGRVRTLayer>
</OGRVRTDataSource>

ogr2ogr apparently successfully imports all five records into tempdb:

$ ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes" london_points.vrt
OGR: OGROpen(london_points.vrt/014AFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes/0338CFF0) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/033A41C8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.

I use a query like this to inspect the results in the database:

SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;

The result set looks like this:

languagecode name         longitude latitude
------------ ------------ --------- ---------
ar           لندن     -0.143555 51.577222
cz           Londýn      -0.143555 51.577222
en           London       -0.143555 51.577222
ru           Лондон -0.143555 51.577222
zw           倫敦       -0.143555 51.577222

The result set contains the correct number of rows. The languagecode, longitude, and latitude columns, which contain only ASCII characters, contain the correct values.

But all the non-ASCII characters in the name column are mangled!

The main ogr2ogr documentation mentions no switch that controls character encoding.

The CSV driver documentation states only that "All CSV files are treated as UTF-8 encoded."

Is this a limitation of ogr2ogr, or am I missing some magic switch somewhere?

Is there a problem with the GDAL SQL Server driver configuration?

Andre Joost asked: "Maybe it is a problem at the GDAL SQL Server driver site. Have you looked at msdn.microsoft.com/en-us/library/ms130822.aspx, AutoTranslate section?"

The two documented values for AutoTranslate are yes and no. It's not explict, but I think yes is the default value.

SQL Server automatically maps the code points for extended characters between different ANSI (single-byte) character sets. If you disable the translation behaviour, character data could get mangled if the client and the server are using different code pages.

The stored data type is Unicode (nvarchar) rather than ANSI (varchar) so I don't think setting would have an effect here.

I'll import the data once with translation explicitly disabled and again with it explicitly enabled to see if there is a difference.

In this example, the connection string contains autotranslate=no:

$ ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=no" london_points.vrt
OGR: OGROpen(london_points.vrt/012AFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes;autotranslate=no")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes;autotranslate=no)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=no/0335E048) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/033732D8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.

SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;

languagecode name         longitude latitude
------------ ------------ --------- ---------
ar           لندن     -0.143555 51.577222
cz           Londýn      -0.143555 51.577222
en           London       -0.143555 51.577222
ru           Лондон -0.143555 51.577222
zw           倫敦       -0.143555 51.577222

In this example the connection string contains autotranslate=yes:

$ ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=yes" london_points.vrt
OGR: OGROpen(london_points.vrt/02DCFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes;autotranslate=yes")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes;autotranslate=yes)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=yes/0344E048) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/034632D8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.

SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;

languagecode name         longitude latitude
------------ ------------ --------- ---------
ar           لندن     -0.143555 51.577222
cz           Londýn      -0.143555 51.577222
en           London       -0.143555 51.577222
ru           Лондон -0.143555 51.577222
zw           倫敦       -0.143555 51.577222

The results are identical.

What SQL statements does ogr2ogr send to server?

I traced the import using SQL Server Profiler to capture the SQL statements that ogr2ogr uses to insert data in to the london_points table.

The trace captured these insert statements:

INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ar', 'لندن', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'cz', 'Londýn', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'en', 'London', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ru', 'Лондон', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'zw', '倫敦', '-0.143555', '51.577222');
go

For clarity I split each INSERT VALUES statement across two lines.

Inspecting the SQL reveals two problems.

ogr2ogr mangles the values for the name column before they reach the server. This proves the problem lies with ogr2ogr.

ogr2ogr passes the values as varchar literals ('abc') rather than nvarchar literals (N'abc'). Even if the query text contained the correct characters, the server would replace with a question mark any character that don't exist in the default code page.

This query shows the difference:

SELECT 'لندن' AS varchar_value, N'لندن' AS nvarchar_value
UNION ALL
SELECT 'Londýn', N'Londýn'
UNION ALL
SELECT 'London', N'London'
UNION ALL
SELECT 'Лондон', N'Лондон'
UNION ALL
SELECT '倫敦', N'倫敦';

varchar_value nvarchar_value
------------- --------------
????          لندن
Londýn        Londýn
London        London
??????        Лондон
??            倫敦

My default code page is Windows-1252. It contains characters for English and Czech, so varchar string literals are fine for these languages. Arabic, Russian, and Traditional Chinese supported only in Unicode, so all the characters are replaced by question marks.

I believe now that this is a fault in the OGR MSSQLSpatial driver.

Best Answer

I have same problem, my understanding is that the stored data is correct, if you consider it a binary utf-8 encoded data, to convert this data to nvarchar i wrote this (partially tested) UDF function in sql server:

CREATE FUNCTION varbinUTF8_to_nvarchar ( @data VARBINARY(MAX) )
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @n INT
    DECLARE @buffer NVARCHAR(MAX)
    SET @buffer = N''
    SET @n = DATALENGTH(@data)

    DECLARE @i INT
    SET @i = 0

    DECLARE @byte INT
    DECLARE @byte2 INT
    DECLARE @byte3 INT
    DECLARE @byte4 INT

    DECLARE @decoded NCHAR

    WHILE @i < @n
        BEGIN
            SET @i = @i + 1
            SET @byte = SUBSTRING(@data, @i, 1)
            IF ( @byte < 0x7f )
                SET @decoded = NCHAR(@byte)
            ELSE
                IF @byte & 0xe0 = 0xc0
                    BEGIN
                        SET @i = @i + 1
                        SET @byte2 = SUBSTRING(@data, @i, 1)
                        SET @decoded = NCHAR(( ( @byte & 0x1F ) * 0x40 )
                                             + ( @byte2 & 0x3f ))       
                    END
                ELSE
                    IF @byte & 0xf0 = 0xe0
                        BEGIN
                            SET @i = @i + 1
                            SET @byte2 = SUBSTRING(@data, @i, 1)
                            SET @i = @i + 1
                            SET @byte3 = SUBSTRING(@data, @i, 1)
                            SET @decoded = NCHAR(( ( @byte & 0xF )
                                                   * 0x1000 ) + ( ( @byte2
                                                          & 0x3f ) * 0x40 )
                                                 + ( (@byte3 & 0x3f) ))
                        END 
                    ELSE
                        IF @byte & 0xf8 = 0xf0
                            BEGIN
                                SET @i = @i + 1
                                SET @byte2 = SUBSTRING(@data, @i, 1)
                                SET @i = @i + 1
                                SET @byte3 = SUBSTRING(@data, @i, 1)
                                SET @i = @i + 1
                                SET @byte4 = SUBSTRING(@data, @i, 1)
                                /*
                                EDIT 2014-11-27 THIS PART WERE WRONG
                                SET @decoded = NCHAR(( ( @byte & 0xF )
                                                       * 0x1000 )
                                                     + ( ( @byte2 & 0x3f )
                                                         * 0x40 )
                                                     + ( ( @byte3 & 0x3f )
                                                         * 0x40 )
                                                     + ( (@byte4 & 0x3f) ))
                                 FOLLOWS PATCH, but sql server does not support unicode chars > 16 bit
                                */
                                SET @decoded = NCHAR(( ( @byte & 0x7 )
                                                       * 0x40000)
                                                     + ( ( @byte2 & 0x3f )
                                                         * 0x1000 )
                                                     + ( ( @byte3 & 0x3f )
                                                         * 0x40 )
                                                     + ( (@byte4 & 0x3f) ))


                            END 
                        ELSE
                            BEGIN
                                return  N'BAD UTF-8 MESSAGE'
                            END 

            SET @buffer = @buffer + @decoded

        END 
    return  @buffer
END

Then i can convert the wrong formatted field using something like:

SELECT     
    dbo.varbinUTF8_to_nvarchar(CONVERT(VARBINARY(MAX), name_1) )
FROM
    dbo.che_adm1

In your specific case you should be able to get nvarchar data doing :

SELECT languagecode, dbo.varbinUTF8_to_nvarchar(CONVERT(VARBINARY(MAX), name) as name, longitude, latitude
FROM tempdb.dbo.london_points;

Best Regards Andrea

Related Question