[Tex/LaTex] German Umlauts in US-ASCII encoding

germaninput-encodings

I got a LaTeX project that has been exported (downloaded) from an ShareLaTeX Online installation. Seems that it stored all Tex files as US-ASCII according to what I see on MacOS using the following command in the terminal:

file -I myfile.tex

Which results in:

myfile.tex: text/x-tex; charset=us-ascii

The original error is:

myfile.tex:8: Undefined control sequence. …\numberline {\thechapter }Vortr\UTF{00E4}ge}{\thepage }} l.8 \chapter{Vortr\UTF{00E4}ge}

And the LaTeX source is (where the \ got copied and pasted here as a ¥ symbol). The original file can be download here.

\documentclass[12pt,a4paper,ngerman,onecolumn]{book}
\usepackage[ngerman,english]{babel}  
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}

\begin{document}

\chapter{Vorträge}

\end{document}

The question is, can I fix this somehow? I know that UTF-8 is the better choice. But the LaTeX project consists of many files that are all in the US-ASCII encoding.

Best Answer

The exported file has replaced the UTF-8 characters by macro calls \UTF{...} with the hexadecimal Unicode code point as argument. The macro could be defined in TeX, but this will not work in all circumstances (verbatim text, ...). Therefore, the best approach is to write a script/program to convert the macro calls back to the UTF-8 encoded Unicode characters.

Here a simple script sharelatex_recode.py for Python 3. It takes the file as argument and updates the file if necessary:

#!/usr/bin/env python

import argparse
import re
import sys

if sys.version_info[0:2] < (3, 2):  # tested with 3.6
    print('Python >= 3.2 is required.')
    sys.exit(1)


def main():
    args = parse_command_line()
    convert(args.input_file, args.dry_run)


def parse_command_line():
    parser = argparse.ArgumentParser(
        description=r'Replace TeX macro \UTF{...} calls to UTF-8 characters.',
    )
    parser.add_argument(
        'input_file',
        help='input TeX file',
    )
    parser.add_argument(
        '--dry-run',
        action='store_true',
        help='the file is not updated and written',
    )
    return parser.parse_args()


def convert(file_name, dry_run):
    with open(file_name, 'rb') as handle:
        data = handle.read()

    new_data, replacements = re.subn(
        br'\\UTF\{([0-9A-Fa-f]{4})\}',
        repl,
        data,
    )

    if replacements:
        print('=> Replacements: {}'.format(replacements))
        if not dry_run:
            with open(file_name, 'wb') as handle:
                handle.write(new_data)

            print('=> File written: {}'.format(file_name))
    else:
        print('=> Already uptodate: {}'.format(file_name))


def repl(match):
    code = int(match.group(1), 16)
    char = chr(code)
    utf8_sequence = char.encode('utf8')
    return utf8_sequence


if __name__ == '__main__':
    main()

The problematic line

\chapter{Vortr\UTF{00E4}ge}

becomes

\chapter{Vorträge}

It is also possible to convert all .tex files recursively, example for bash:

$ find start_directory -name \*.tex -exec python3 sharelatex_recode.py {} \;

Unicode characters outside the BMP (Basic Multilingual Plane) are not supported by the script, because I do not know, what the export of ShareLaTeX to US ASCII does in this case.