[Tex/LaTex] Using XeTeX for automatic transliteration of cyrillic letters

cyrillicfont-encodingsluatexxetex

This is a follow-up question to the following question: Serbian Cyrillic using LuaTeX and XeTeX.

I actually often need character substitution the other way around, that is, I write cyrillic, but want transliterated output, e.g. I type добрый but get dobryj in the result document. This is very handy, and I use the following mappings with pdflatex to achieve this (I'm including this so people can reuse it):

\DeclareUnicodeCharacter{1040}{A}
\DeclareUnicodeCharacter{1041}{B}
\DeclareUnicodeCharacter{1042}{V}
\DeclareUnicodeCharacter{1043}{G}
\DeclareUnicodeCharacter{1044}{D}
\DeclareUnicodeCharacter{1045}{E}
\DeclareUnicodeCharacter{1046}{Ž}
\DeclareUnicodeCharacter{1047}{Z}
\DeclareUnicodeCharacter{1049}{J}
\DeclareUnicodeCharacter{1050}{K}
\DeclareUnicodeCharacter{1051}{L}
\DeclareUnicodeCharacter{1052}{M}
\DeclareUnicodeCharacter{1053}{N}
\DeclareUnicodeCharacter{1054}{O}
\DeclareUnicodeCharacter{1055}{P}
\DeclareUnicodeCharacter{1056}{R}
\DeclareUnicodeCharacter{1057}{S}
\DeclareUnicodeCharacter{1058}{T}
\DeclareUnicodeCharacter{1059}{U}
\DeclareUnicodeCharacter{1060}{F}
\DeclareUnicodeCharacter{1062}{C}
\DeclareUnicodeCharacter{1063}{Č}
\DeclareUnicodeCharacter{1064}{Š}
\DeclareUnicodeCharacter{1069}{Ė}
\DeclareUnicodeCharacter{1070}{Ju}
\DeclareUnicodeCharacter{1071}{Ja}
\DeclareUnicodeCharacter{1025}{Ë}
\DeclareUnicodeCharacter{1072}{a}
\DeclareUnicodeCharacter{1073}{b}
\DeclareUnicodeCharacter{1074}{v}
\DeclareUnicodeCharacter{1075}{g}
\DeclareUnicodeCharacter{1076}{d}
\DeclareUnicodeCharacter{1077}{e}
\DeclareUnicodeCharacter{1078}{ž}
\DeclareUnicodeCharacter{1079}{z}
\DeclareUnicodeCharacter{1080}{i}
\DeclareUnicodeCharacter{1081}{j}
\DeclareUnicodeCharacter{1082}{k}
\DeclareUnicodeCharacter{1083}{l}
\DeclareUnicodeCharacter{1084}{m}
\DeclareUnicodeCharacter{1085}{n}
\DeclareUnicodeCharacter{1086}{o}
\DeclareUnicodeCharacter{1087}{p}
\DeclareUnicodeCharacter{1088}{r}
\DeclareUnicodeCharacter{1089}{s}
\DeclareUnicodeCharacter{1090}{t}
\DeclareUnicodeCharacter{1091}{u}
\DeclareUnicodeCharacter{1092}{f}
\DeclareUnicodeCharacter{1094}{c}
\DeclareUnicodeCharacter{1095}{č}
\DeclareUnicodeCharacter{1096}{š}
\DeclareUnicodeCharacter{1101}{ė}
\DeclareUnicodeCharacter{1102}{ju}
\DeclareUnicodeCharacter{1103}{ja}
\DeclareUnicodeCharacter{1105}{ë}
\DeclareUnicodeCharacter{1110}{i}
\DeclareUnicodeCharacter{1030}{I}
\DeclareUnicodeCharacter{1108}{je}
\DeclareUnicodeCharacter{1028}{Je}
\DeclareUnicodeCharacter{1061}{X}
\DeclareUnicodeCharacter{1093}{x}
\DeclareUnicodeCharacter{1048}{I}
\DeclareUnicodeCharacter{1065}{ŠČ}
\DeclareUnicodeCharacter{1066}{'}
\DeclareUnicodeCharacter{1067}{Y}
\DeclareUnicodeCharacter{1068}{'}
\DeclareUnicodeCharacter{1097}{šč}
\DeclareUnicodeCharacter{1098}{'}
\DeclareUnicodeCharacter{1099}{y}
\DeclareUnicodeCharacter{1100}{'}

My question is: is there a straight-forward way of reusing this very mapping in XeTex?
I assume: no, I need to input all the UTF-8 codes, right? But maybe somebody else has already done that. Is there any repository of mapping files?

Best Answer

The method is similar to that one used for Serbian. Prepare the following cyrillic-to-latin.map file:

; TECkit mapping for TeX input conventions <-> Unicode characters

LHSName "Cyrillic-to-Latin"
RHSName "UNICODE"

pass(Unicode)

; ligatures from Knuth's original CMR fonts
U+002D U+002D           <>  U+2013  ; -- -> en dash
U+002D U+002D U+002D    <>  U+2014  ; --- -> em dash

U+0027          <>  U+2019  ; ' -> right single quote
U+0027 U+0027   <>  U+201D  ; '' -> right double quote
U+0022           >  U+201D  ; " -> right double quote

U+0060          <>  U+2018  ; ` -> left single quote
U+0060 U+0060   <>  U+201C  ; `` -> left double quote

U+0021 U+0060   <>  U+00A1  ; !` -> inverted exclam
U+003F U+0060   <>  U+00BF  ; ?` -> inverted question

; additions supported in T1 encoding
U+002C U+002C   <>  U+201E  ; ,, -> DOUBLE LOW-9 QUOTATION MARK
U+003C U+003C   <>  U+00AB  ; << -> LEFT POINTING GUILLEMET
U+003E U+003E   <>  U+00BB  ; >> -> RIGHT POINTING GUILLEMET


U+0410 <> U+0041  ; A
U+0411 <> U+0042  ; B
U+0412 <> U+0056  ; V
U+0413 <> U+0047  ; G
U+0414 <> U+0044  ; D
U+0415 <> U+0045  ; E
U+0416 <> U+017D  ; Ž
U+0417 <> U+005A  ; Z
U+0418 <> U+004A  ; J
U+041A <> U+004B  ; K
U+041B <> U+004C  ; L
U+041C <> U+004D  ; M
U+041D <> U+004E  ; N
U+041E <> U+004F  ; O
U+041F <> U+0050  ; P
U+0420 <> U+0052  ; R
U+0421 <> U+0053  ; S
U+0422 <> U+0054  ; T
U+0423 <> U+0055  ; U
U+0424 <> U+0046  ; F
U+0426 <> U+0043  ; C
U+0427 <> U+010C  ; Č
U+0428 <> U+0160  ; Š
U+042D <> U+0116  ; Ė
U+042E <> U+004A U+0075  ; Ju
U+042F <> U+004A U+0061  ; Ja
U+0401 <> U+00CB  ; Ë
U+0430 <> U+0061  ; a
U+0431 <> U+0062  ; b
U+0432 <> U+0076  ; v
U+0433 <> U+0067  ; g
U+0434 <> U+0064  ; d
U+0435 <> U+0065  ; e
U+0436 <> U+017E  ; ž
U+0437 <> U+007A  ; z
U+0438 <> U+0069  ; i
U+0439 <> U+006A  ; j
U+043A <> U+006B  ; k
U+043B <> U+006C  ; l
U+043C <> U+006D  ; m
U+043D <> U+006E  ; n
U+043E <> U+006F  ; o
U+043F <> U+0070  ; p
U+0440 <> U+0072  ; r
U+0441 <> U+0073  ; s
U+0442 <> U+0074  ; t
U+0443 <> U+0075  ; u
U+0444 <> U+0066  ; f
U+0446 <> U+0063  ; c
U+0447 <> U+010D  ; č
U+0448 <> U+0161  ; š
U+044D <> U+0117  ; ė
U+044E <> U+006A U+0075  ; ju
U+044F <> U+006A U+0061  ; ja
U+0451 <> U+00EB  ; ë
U+0456 <> U+0069  ; i
U+0406 <> U+0049  ; I
U+0454 <> U+006A U+0065  ; je
U+0468 <> U+004A U+0065  ; Je
U+0425 <> U+0058  ; X
U+0445 <> U+0078  ; x
U+0418 <> U+0049  ; I
U+0429 <> U+0160  U+010C ; ŠČ
U+042A <> U+0027  ; '
U+042B <> U+0059  ; Y
U+042C <> U+2019  ; '
U+0449 <> U+0161  U+010D ; šč
U+044A <> U+2019  ; '
U+044B <> U+0079  ; y
U+044C <> U+2019  ; '

and run it through teckit_compile to produce the file cyrillic-to-latin.tec file that should be put in a place where XeTeX can find it. Then a document such as the following

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX]{Linux Libertine O}
\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguage{russian}
\newfontfamily{\transrussian}[Mapping=cyrillic-to-latin]{Linux Libertine O}

\newenvironment{translitterated}
  {\transrussian\hyphenrules{nohyphenation}\ignorespaces}
  {\ignorespacesafterend}

\begin{document}

\begin{russian}
Москва — столица Российской Федерации, город федерального значения,
административный центр Центрального федерального округа и центр
Московской области, в состав которой не входит. Крупнейший по
численности населения город России и Европы (население на 1 января
2012 года — 11 629 116 человек), по этому показателю входит в
десятку крупнейших городов мира. Центр Московской городской
агломерации.
\end{russian}

\begin{translitterated}
Москва — столица Российской Федерации, город федерального значения,
административный центр Центрального федерального округа и центр
Московской области, в состав которой не входит. Крупнейший по
численности населения город России и Европы (население на 1 января
2012 года — 11 629 116 человек), по этому показателю входит в
десятку крупнейших городов мира. Центр Московской городской
агломерации.
\end{translitterated}

\end{document}

will give a result similar to the following


enter image description here


The nohyphenation in the translitterated environment definition is necessary as XeTeX doesn't know how to hyphenate translitterated Russian.

Related Question