Expandable macro that extracts the first character of UTF-8/cyrillic string without additional packages

characterscyrillicstringsunicode

I would like to have an expandable macro that extracts the first (and sometimes the second) character of UTF-8/Cyrillic text strings without using additional packages. No simple solutions from TeX or LaTeX work with UTF-8/Cyrillic strings.

I give below an example of a working macro, which is partially taken from Get the first and second character of a macro argument :

\documentclass{article}

\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel}

\makeatletter
\newcommand{\firstof}[1]{\@car#1\@nil}
\makeatother

\begin{document}

\firstof{Vladimir}

\end{document}

Unfortunately, this example fails with the error Error: Invalid UTF-8 byte sequence (Ð\par) using Cyrillic strings like \firstof{Владимир}.

I roughly understand that by default TeX is not adapted to manipulating strings with multibyte characters, but this problem is solved in some packages. However, I do not want to use other packages for such a simple problem (as it seems at first glance) and I will be grateful to the community for help and tips.

Ideally, I would like to have an expandable macro like \newcommand{\firstof}[2][1]{.....}, which by default for UTF-8/Cyrillic strings returns the value of the first character, for example, in the case of \firstof{Владимир} returns В, and for \firstof[2]{Владимир} returns Вл, and these chars could be used in /ifx to compare with others and written to a file using \write.

Best Answer

Each byte of the UTF-8 encoding is a separate token in pdflatex, however you can recognise the leading token which tells you how many bytes are needed. This version covers the one and two byte cases.

enter image description here

\documentclass{article}

\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel}

\makeatletter
\newcommand{\firstof}[1]{\expandafter\checkfirst#1\@nil}
\def\checkfirst#1{%
  \ifx\UTFviii@two@octets#1%
  \expandafter\gettwooctets
  \else
  \expandafter\@car\expandafter#1%
  \fi
}
\def\gettwooctets#1#2#3\@nil{\UTFviii@two@octets#1#2}

\makeatother

\begin{document}

\firstof{Vladimir}

\firstof{Владимир}

\end{document}

If you want to handle the rest of the input as opposed to discarding everything after the first letter, you can make a small change so that you pass in a command to appy to the remaning text. If you pass in \gobble it extracts as before. If you pass in \firstofx\gobble then it exctracts the first letter of the remaining text so you get two letters:

enter image description here

\documentclass{article}

\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel}

\makeatletter
\newcommand{\firstofx}[2]{\expandafter\checkfirst#2\@nil{#1}}
\def\checkfirst#1{%
  \ifx\UTFviii@two@octets#1%
  \expandafter\gettwooctetsx
  \else
  \expandafter\getasciix\expandafter#1%
  \fi
}

\def\getasciix#1#2\@nil#3{#1#3{#2}}

\def\gettwooctetsx#1#2#3\@nil#4{\UTFviii@two@octets#1#2#4{#3}}

\newcommand\gobble[1]{}

\makeatother

\begin{document}

\firstofx\gobble{Vladimir}

\firstofx{\firstofx\gobble}{Vladimir}

\firstofx\gobble{Владимир}

\firstofx{\firstofx\gobble}{Владимир}


\end{document}