Expandable macro that extracts the first character of UTF-8/cyrillic string without additional packages

characterscyrillicstringsunicode

I would like to have an expandable macro that extracts the first (and sometimes the second) character of UTF-8/Cyrillic text strings without using additional packages. No simple solutions from TeX or LaTeX work with UTF-8/Cyrillic strings.

I give below an example of a working macro, which is partially taken from Get the first and second character of a macro argument :

\documentclass{article}

\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel}

\makeatletter
\newcommand{\firstof}[1]{\@car#1\@nil}
\makeatother

\begin{document}

\firstof{Vladimir}

\end{document}

Unfortunately, this example fails with the error Error: Invalid UTF-8 byte sequence (Ð\par) using Cyrillic strings like \firstof{Владимир}.

I roughly understand that by default TeX is not adapted to manipulating strings with multibyte characters, but this problem is solved in some packages. However, I do not want to use other packages for such a simple problem (as it seems at first glance) and I will be grateful to the community for help and tips.

Ideally, I would like to have an expandable macro like \newcommand{\firstof}[2][1]{.....}, which by default for UTF-8/Cyrillic strings returns the value of the first character, for example, in the case of \firstof{Владимир} returns В, and for \firstof[2]{Владимир} returns Вл, and these chars could be used in /ifx to compare with others and written to a file using \write.

Best Answer

Each byte of the UTF-8 encoding is a separate token in pdflatex, however you can recognise the leading token which tells you how many bytes are needed. This version covers the one and two byte cases.

\documentclass{article}

\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel}

\makeatletter
\newcommand{\firstof}[1]{\expandafter\checkfirst#1\@nil}
\def\checkfirst#1{%
  \ifx\UTFviii@two@octets#1%
  \expandafter\gettwooctets
  \else
  \expandafter\@car\expandafter#1%
  \fi
}
\def\gettwooctets#1#2#3\@nil{\UTFviii@two@octets#1#2}

\makeatother

\begin{document}

\firstof{Vladimir}

\firstof{Владимир}

\end{document}

If you want to handle the rest of the input as opposed to discarding everything after the first letter, you can make a small change so that you pass in a command to appy to the remaning text. If you pass in \gobble it extracts as before. If you pass in \firstofx\gobble then it exctracts the first letter of the remaining text so you get two letters:

\documentclass{article}

\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel}

\makeatletter
\newcommand{\firstofx}[2]{\expandafter\checkfirst#2\@nil{#1}}
\def\checkfirst#1{%
  \ifx\UTFviii@two@octets#1%
  \expandafter\gettwooctetsx
  \else
  \expandafter\getasciix\expandafter#1%
  \fi
}

\def\getasciix#1#2\@nil#3{#1#3{#2}}

\def\gettwooctetsx#1#2#3\@nil#4{\UTFviii@two@octets#1#2#4{#3}}

\newcommand\gobble[1]{}

\makeatother

\begin{document}

\firstofx\gobble{Vladimir}

\firstofx{\firstofx\gobble}{Vladimir}

\firstofx\gobble{Владимир}

\firstofx{\firstofx\gobble}{Владимир}


\end{document}

Related Solutions

Is there a way to use string manipulations to make macro names

This is essentially the same as @gernot's answer but reduces the number of \expandafters and \csnames making the code more readable, imho.

The approach uses two steps of processing, the first step just reads in the first argument (the others are curried) and creates the macro names from it, resulting in two new arguments. The next step then gets all the arguments.

\documentclass[]{article}

\usepackage{tikz}

\makeatletter
\newcommand\newABC[1]
  {%
    % #1: name for macro and box
    % #2: before   (curried)
    % #3: raise    (curried)
    % #4: after    (curried)
    % #5: contents (curried)
    \expandafter\newABC@\csname #1\expandafter\endcsname\csname #1box\endcsname
  }
\newcommand\newABC@[6]
  {%
    % #1: macro
    % #2: box-macro
    % #3: before
    % #4: raise
    % #5: after
    % #6: contents
    \newsavebox#2%
    \sbox#2{\mathalpha{\hspace{#3pt}\raisebox{#4pt}{#6}\hspace{#5pt}}}%
    \newcommand#1{\usebox#2}%
  }
\makeatother

\newABC{ehh}{1}{1}{.5}
{%
\begin{tikzpicture}
\node at (0,0){\(h\)};
\end{tikzpicture}%
}

\begin{document}
A\ehh b
\end{document}

As requested a few explanations on the code:

\csname expands all following tokens until it finds an \endcsname and the result is turned into a control sequence.
\expandafter steps over the next token (regardless which kind of token, an opening brace for instance is a token as well which could be stepped over with this) and expands the token after that one once (if that token isn't expandable nothing happens).

So \expandafter\stuff\csname foo\endcsname will result in \stuff being stepped over and \csname being expanded once. Within a single step of expansion \csname expands all following tokens until it finds an \endcsname and leaves everything in between as the name of a control sequence. In this case it'll find foo (letters don't expand further), and so after \csname is done the \expandafter will be removed from the input stream and \stuff put back, so the input will now contain \stuff\foo.

We can utilize the fact that \csname expands everything until it finds an \endcsname to build two control sequences at once (in the following the next thing TeX will evaluate will be preceded by |> and the tokens stored to be put back because TeX stepped over them will be preceded by || -- this is the same style the unravel package would use, though my steps might not be the same that package would show):

|> \expandafter\stuff\csname foo\expandafter\endcsname\csname bar\endcsname

will first step over \stuff, so the input will look like this (this is less than one step of expansion, more a step of processing):

|| \expandafter\stuff
|> \csname foo\expandafter\endcsname\csname bar\endcsname

Now \csname will start grabbing/expanding tokens, and because of \expandafter the \endcsname will not be found, instead TeX steps over it and expands what follows:

|| \expandafter\stuff
|| \csname foo\expandafter\endcsname
|> \csname bar\endcsname

Now the second \csname grabs/expands tokens until it finds \endcsname and turn the found string into a control sequence:

|| \expandafter\stuff
|| \csname foo\expandafter\endcsname
|| \csname bar
|> \endcsname

and

|| \expandafter\stuff
|| \csname foo\expandafter\endcsname
|| \bar

Now the second \csname is done with one step of expansion and the second \expandafter will be removed and the token which followed it put back, so the next step of processing would look like

|| \expandafter\stuff
|| \csname foo
|> \endcsname\bar

The first \csname finally finds its \endcsname and this will become

|| \expandafter\stuff
|| \foo\bar

Now also the first \csname had its step of expansion from \expandafter, so it'll be removed and \stuff put back, so this eventually becomes

|> \stuff\foo\bar

and now \stuff can do stuff.

Even though the above was visualized in many small steps of processing when we look at expansion steps this is all done in a single step, because \expandafter will in a single step expand the \csname and that will fully expand the remaining stuff in this one step.

Capitalize first letter of non-latin word (utf-8 two bits)

Use \MakeTitlecase (this requires a current LaTeX, the underlying expl3 command \text_titlecase:n can be used also in older systems).

\documentclass{article}
\usepackage[T2A]{fontenc}

\begin{document}

\MakeUppercase{з}дравей 
   
\MakeTitlecase{здравей}

\end{document}

Best Answer

Related Solutions

Is there a way to use string manipulations to make macro names

Capitalize first letter of non-latin word (utf-8 two bits)

Related Question