[Tex/LaTex] German Umlauts in US-ASCII encoding

germaninput-encodings

I got a LaTeX project that has been exported (downloaded) from an ShareLaTeX Online installation. Seems that it stored all Tex files as US-ASCII according to what I see on MacOS using the following command in the terminal:

file -I myfile.tex

Which results in:

myfile.tex: text/x-tex; charset=us-ascii

The original error is:

myfile.tex:8: Undefined control sequence. …\numberline {\thechapter }Vortr\UTF{00E4}ge}{\thepage }} l.8 \chapter{Vortr\UTF{00E4}ge}

And the LaTeX source is (where the \ got copied and pasted here as a ¥ symbol). The original file can be download here.

\documentclass[12pt,a4paper,ngerman,onecolumn]{book}
\usepackage[ngerman,english]{babel}  
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}

\begin{document}

\chapter{Vorträge}

\end{document}

The question is, can I fix this somehow? I know that UTF-8 is the better choice. But the LaTeX project consists of many files that are all in the US-ASCII encoding.

Best Answer

The exported file has replaced the UTF-8 characters by macro calls \UTF{...} with the hexadecimal Unicode code point as argument. The macro could be defined in TeX, but this will not work in all circumstances (verbatim text, ...). Therefore, the best approach is to write a script/program to convert the macro calls back to the UTF-8 encoded Unicode characters.

Here a simple script sharelatex_recode.py for Python 3. It takes the file as argument and updates the file if necessary:

#!/usr/bin/env python

import argparse
import re
import sys

if sys.version_info[0:2] < (3, 2):  # tested with 3.6
    print('Python >= 3.2 is required.')
    sys.exit(1)


def main():
    args = parse_command_line()
    convert(args.input_file, args.dry_run)


def parse_command_line():
    parser = argparse.ArgumentParser(
        description=r'Replace TeX macro \UTF{...} calls to UTF-8 characters.',
    )
    parser.add_argument(
        'input_file',
        help='input TeX file',
    )
    parser.add_argument(
        '--dry-run',
        action='store_true',
        help='the file is not updated and written',
    )
    return parser.parse_args()


def convert(file_name, dry_run):
    with open(file_name, 'rb') as handle:
        data = handle.read()

    new_data, replacements = re.subn(
        br'\\UTF\{([0-9A-Fa-f]{4})\}',
        repl,
        data,
    )

    if replacements:
        print('=> Replacements: {}'.format(replacements))
        if not dry_run:
            with open(file_name, 'wb') as handle:
                handle.write(new_data)

            print('=> File written: {}'.format(file_name))
    else:
        print('=> Already uptodate: {}'.format(file_name))


def repl(match):
    code = int(match.group(1), 16)
    char = chr(code)
    utf8_sequence = char.encode('utf8')
    return utf8_sequence


if __name__ == '__main__':
    main()

The problematic line

\chapter{Vortr\UTF{00E4}ge}

becomes

\chapter{Vorträge}

It is also possible to convert all .tex files recursively, example for bash:

$ find start_directory -name \*.tex -exec python3 sharelatex_recode.py {} \;

Unicode characters outside the BMP (Basic Multilingual Plane) are not supported by the script, because I do not know, what the export of ShareLaTeX to US ASCII does in this case.

Related Solutions

[Tex/LaTex] utf8 or latin1 encoding – German

If you can: do not use \usepackage[utf8]{inputenc} nor \usepackage[latin1]{inputenc}. Use LuaTeX:

\documentclass{article}
\usepackage{luaotfload}
\usepackage[EU2]{fontenc}
\usepackage{lmodern}
\begin{document}
Das Mädchen ging über die \textbf{Brücke} nach \textit{draußen}.
\end{document}

This will give you access to all modern things (OpenType fonts for example) while keeping most of the backward compatibility.

Wait for TeXlive 2010 (or get the pretest) and you have a decent environment for LuaTeX. Million thanks to the few people who make the LuaLaTeX packages!

If you are able to read german: see the site http://www.luatex.org for more examples (especially on fontspec).

[Tex/LaTex] German Umlauts in BibTeX

Instead of actually entering the ö and é characters in your source, type \"{o} and \'{e}.

Or rather: Yrjö Engeström as {Yrj{\"o} Engestr{\"o}m}.

Best Answer

Related Solutions

[Tex/LaTex] utf8 or latin1 encoding – German

[Tex/LaTex] German Umlauts in BibTeX

Related Question