[Tex/LaTex] Using XeTeX for automatic transliteration of cyrillic letters

cyrillicfont-encodingsluatexxetex

This is a follow-up question to the following question: Serbian Cyrillic using LuaTeX and XeTeX.

I actually often need character substitution the other way around, that is, I write cyrillic, but want transliterated output, e.g. I type добрый but get dobryj in the result document. This is very handy, and I use the following mappings with pdflatex to achieve this (I'm including this so people can reuse it):

\DeclareUnicodeCharacter{1040}{A}
\DeclareUnicodeCharacter{1041}{B}
\DeclareUnicodeCharacter{1042}{V}
\DeclareUnicodeCharacter{1043}{G}
\DeclareUnicodeCharacter{1044}{D}
\DeclareUnicodeCharacter{1045}{E}
\DeclareUnicodeCharacter{1046}{Ž}
\DeclareUnicodeCharacter{1047}{Z}
\DeclareUnicodeCharacter{1049}{J}
\DeclareUnicodeCharacter{1050}{K}
\DeclareUnicodeCharacter{1051}{L}
\DeclareUnicodeCharacter{1052}{M}
\DeclareUnicodeCharacter{1053}{N}
\DeclareUnicodeCharacter{1054}{O}
\DeclareUnicodeCharacter{1055}{P}
\DeclareUnicodeCharacter{1056}{R}
\DeclareUnicodeCharacter{1057}{S}
\DeclareUnicodeCharacter{1058}{T}
\DeclareUnicodeCharacter{1059}{U}
\DeclareUnicodeCharacter{1060}{F}
\DeclareUnicodeCharacter{1062}{C}
\DeclareUnicodeCharacter{1063}{Č}
\DeclareUnicodeCharacter{1064}{Š}
\DeclareUnicodeCharacter{1069}{Ė}
\DeclareUnicodeCharacter{1070}{Ju}
\DeclareUnicodeCharacter{1071}{Ja}
\DeclareUnicodeCharacter{1025}{Ë}
\DeclareUnicodeCharacter{1072}{a}
\DeclareUnicodeCharacter{1073}{b}
\DeclareUnicodeCharacter{1074}{v}
\DeclareUnicodeCharacter{1075}{g}
\DeclareUnicodeCharacter{1076}{d}
\DeclareUnicodeCharacter{1077}{e}
\DeclareUnicodeCharacter{1078}{ž}
\DeclareUnicodeCharacter{1079}{z}
\DeclareUnicodeCharacter{1080}{i}
\DeclareUnicodeCharacter{1081}{j}
\DeclareUnicodeCharacter{1082}{k}
\DeclareUnicodeCharacter{1083}{l}
\DeclareUnicodeCharacter{1084}{m}
\DeclareUnicodeCharacter{1085}{n}
\DeclareUnicodeCharacter{1086}{o}
\DeclareUnicodeCharacter{1087}{p}
\DeclareUnicodeCharacter{1088}{r}
\DeclareUnicodeCharacter{1089}{s}
\DeclareUnicodeCharacter{1090}{t}
\DeclareUnicodeCharacter{1091}{u}
\DeclareUnicodeCharacter{1092}{f}
\DeclareUnicodeCharacter{1094}{c}
\DeclareUnicodeCharacter{1095}{č}
\DeclareUnicodeCharacter{1096}{š}
\DeclareUnicodeCharacter{1101}{ė}
\DeclareUnicodeCharacter{1102}{ju}
\DeclareUnicodeCharacter{1103}{ja}
\DeclareUnicodeCharacter{1105}{ë}
\DeclareUnicodeCharacter{1110}{i}
\DeclareUnicodeCharacter{1030}{I}
\DeclareUnicodeCharacter{1108}{je}
\DeclareUnicodeCharacter{1028}{Je}
\DeclareUnicodeCharacter{1061}{X}
\DeclareUnicodeCharacter{1093}{x}
\DeclareUnicodeCharacter{1048}{I}
\DeclareUnicodeCharacter{1065}{ŠČ}
\DeclareUnicodeCharacter{1066}{'}
\DeclareUnicodeCharacter{1067}{Y}
\DeclareUnicodeCharacter{1068}{'}
\DeclareUnicodeCharacter{1097}{šč}
\DeclareUnicodeCharacter{1098}{'}
\DeclareUnicodeCharacter{1099}{y}
\DeclareUnicodeCharacter{1100}{'}

My question is: is there a straight-forward way of reusing this very mapping in XeTex?
I assume: no, I need to input all the UTF-8 codes, right? But maybe somebody else has already done that. Is there any repository of mapping files?

Best Answer

The method is similar to that one used for Serbian. Prepare the following cyrillic-to-latin.map file:

; TECkit mapping for TeX input conventions <-> Unicode characters

LHSName "Cyrillic-to-Latin"
RHSName "UNICODE"

pass(Unicode)

; ligatures from Knuth's original CMR fonts
U+002D U+002D           <>  U+2013  ; -- -> en dash
U+002D U+002D U+002D    <>  U+2014  ; --- -> em dash

U+0027          <>  U+2019  ; ' -> right single quote
U+0027 U+0027   <>  U+201D  ; '' -> right double quote
U+0022           >  U+201D  ; " -> right double quote

U+0060          <>  U+2018  ; ` -> left single quote
U+0060 U+0060   <>  U+201C  ; `` -> left double quote

U+0021 U+0060   <>  U+00A1  ; !` -> inverted exclam
U+003F U+0060   <>  U+00BF  ; ?` -> inverted question

; additions supported in T1 encoding
U+002C U+002C   <>  U+201E  ; ,, -> DOUBLE LOW-9 QUOTATION MARK
U+003C U+003C   <>  U+00AB  ; << -> LEFT POINTING GUILLEMET
U+003E U+003E   <>  U+00BB  ; >> -> RIGHT POINTING GUILLEMET


U+0410 <> U+0041  ; A
U+0411 <> U+0042  ; B
U+0412 <> U+0056  ; V
U+0413 <> U+0047  ; G
U+0414 <> U+0044  ; D
U+0415 <> U+0045  ; E
U+0416 <> U+017D  ; Ž
U+0417 <> U+005A  ; Z
U+0418 <> U+004A  ; J
U+041A <> U+004B  ; K
U+041B <> U+004C  ; L
U+041C <> U+004D  ; M
U+041D <> U+004E  ; N
U+041E <> U+004F  ; O
U+041F <> U+0050  ; P
U+0420 <> U+0052  ; R
U+0421 <> U+0053  ; S
U+0422 <> U+0054  ; T
U+0423 <> U+0055  ; U
U+0424 <> U+0046  ; F
U+0426 <> U+0043  ; C
U+0427 <> U+010C  ; Č
U+0428 <> U+0160  ; Š
U+042D <> U+0116  ; Ė
U+042E <> U+004A U+0075  ; Ju
U+042F <> U+004A U+0061  ; Ja
U+0401 <> U+00CB  ; Ë
U+0430 <> U+0061  ; a
U+0431 <> U+0062  ; b
U+0432 <> U+0076  ; v
U+0433 <> U+0067  ; g
U+0434 <> U+0064  ; d
U+0435 <> U+0065  ; e
U+0436 <> U+017E  ; ž
U+0437 <> U+007A  ; z
U+0438 <> U+0069  ; i
U+0439 <> U+006A  ; j
U+043A <> U+006B  ; k
U+043B <> U+006C  ; l
U+043C <> U+006D  ; m
U+043D <> U+006E  ; n
U+043E <> U+006F  ; o
U+043F <> U+0070  ; p
U+0440 <> U+0072  ; r
U+0441 <> U+0073  ; s
U+0442 <> U+0074  ; t
U+0443 <> U+0075  ; u
U+0444 <> U+0066  ; f
U+0446 <> U+0063  ; c
U+0447 <> U+010D  ; č
U+0448 <> U+0161  ; š
U+044D <> U+0117  ; ė
U+044E <> U+006A U+0075  ; ju
U+044F <> U+006A U+0061  ; ja
U+0451 <> U+00EB  ; ë
U+0456 <> U+0069  ; i
U+0406 <> U+0049  ; I
U+0454 <> U+006A U+0065  ; je
U+0468 <> U+004A U+0065  ; Je
U+0425 <> U+0058  ; X
U+0445 <> U+0078  ; x
U+0418 <> U+0049  ; I
U+0429 <> U+0160  U+010C ; ŠČ
U+042A <> U+0027  ; '
U+042B <> U+0059  ; Y
U+042C <> U+2019  ; '
U+0449 <> U+0161  U+010D ; šč
U+044A <> U+2019  ; '
U+044B <> U+0079  ; y
U+044C <> U+2019  ; '

and run it through teckit_compile to produce the file cyrillic-to-latin.tec file that should be put in a place where XeTeX can find it. Then a document such as the following

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX]{Linux Libertine O}
\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguage{russian}
\newfontfamily{\transrussian}[Mapping=cyrillic-to-latin]{Linux Libertine O}

\newenvironment{translitterated}
  {\transrussian\hyphenrules{nohyphenation}\ignorespaces}
  {\ignorespacesafterend}

\begin{document}

\begin{russian}
Москва — столица Российской Федерации, город федерального значения,
административный центр Центрального федерального округа и центр
Московской области, в состав которой не входит. Крупнейший по
численности населения город России и Европы (население на 1 января
2012 года — 11 629 116 человек), по этому показателю входит в
десятку крупнейших городов мира. Центр Московской городской
агломерации.
\end{russian}

\begin{translitterated}
Москва — столица Российской Федерации, город федерального значения,
административный центр Центрального федерального округа и центр
Московской области, в состав которой не входит. Крупнейший по
численности населения город России и Европы (население на 1 января
2012 года — 11 629 116 человек), по этому показателю входит в
десятку крупнейших городов мира. Центр Московской городской
агломерации.
\end{translitterated}

\end{document}

will give a result similar to the following

enter image description here

The nohyphenation in the translitterated environment definition is necessary as XeTeX doesn't know how to hyphenate translitterated Russian.

Related Solutions

[Tex/LaTex] Serbian Cyrillic using LuaTeX and XeTeX

Here is a method for XeLaTeX.

Prepare a file ascii-to-serbian.map with the following content:

; TECkit mapping for TeX input conventions <-> Unicode characters

LHSName "ASCII-to-Serbian"
RHSName "UNICODE"

pass(Unicode)

; ligatures from Knuth's original CMR fonts
U+002D U+002D           <>  U+2013  ; -- -> en dash
U+002D U+002D U+002D    <>  U+2014  ; --- -> em dash

U+0027          <>  U+2019  ; ' -> right single quote
U+0027 U+0027   <>  U+201D  ; '' -> right double quote
U+0022           >  U+201D  ; " -> right double quote

U+0060          <>  U+2018  ; ` -> left single quote
U+0060 U+0060   <>  U+201C  ; `` -> left double quote

U+0021 U+0060   <>  U+00A1  ; !` -> inverted exclam
U+003F U+0060   <>  U+00BF  ; ?` -> inverted question

; additions supported in T1 encoding
U+002C U+002C   <>  U+201E  ; ,, -> DOUBLE LOW-9 QUOTATION MARK
U+003C U+003C   <>  U+00AB  ; << -> LEFT POINTING GUILLEMET
U+003E U+003E   <>  U+00BB  ; >> -> RIGHT POINTING GUILLEMET

U+0041 <> U+0410 ; A
U+0042 <> U+0411 ; B
U+0043 <> U+0426 ; C
U+0043 U+0048 <> U+0427 ; CH
U+0043 U+0068 <> U+0427 ; Ch
U+0043 U+0031 <> U+040B ; C1
U+0027 U+0043 <> U+040B ; 'C
U+0044 <> U+0414 ; D
U+0044 U+004A <> U+0402 ; DJ
U+0044 U+006A <> U+0402 ; Dj
U+0044 U+005A U+0048 <> U+040F ; DZH
U+0044 U+007A U+0068 <> U+040F ; Dzh
U+0044 U+0031 <> U+040F ; D1
U+0045 <> U+0415 ; E
U+0046 <> U+0424 ; F
U+0047 <> U+0413 ; G
U+0048 <> U+0425 ; H
U+0049 <> U+0418 ; I
U+004A <> U+0408 ; J
U+004B <> U+041A ; K
U+004B U+0048 <> U+0425 ; KH
U+004B U+0068 <> U+0425 ; Kh
U+004C <> U+041B ; L
U+004C U+004A <> U+0409 ; LJ
U+004C U+006A <> U+0409 ; Lj
U+004D <> U+041C ; M
U+004E <> U+041D ; N
U+004E U+004A <> U+040A ; NJ
U+004E U+006A <> U+040A ; Nj
U+004F <> U+041E ; O
U+0050 <> U+041F ; P
;U+0051 <> ; Q
U+0052 <> U+0420 ; R
U+0053 <> U+0421 ; S
U+0053 U+0048 <> U+0428 ; SH
U+0053 U+0068 <> U+0428 ; Sh
U+0054 <> U+0422 ; T
U+0055 <> U+0423 ; U
U+0056 <> U+0412 ; V
;U+0057 <> ; W
U+0058 <> U+0425 ; X
;U+0059 ; Y
U+005A <> U+0417 ; Z
U+005A U+0048 <> U+0416 ; ZH
U+005A U+0068 <> U+0416 ; Zh

U+0061 <> U+0430 ; a
U+0062 <> U+0431 ; b
U+0063 <> U+0446 ; c
U+0063 U+0068 <> U+0447 ; ch
U+0063 U+0031 <> U+045B ; c1
U+0027 U+0063 <> U+045B ; 'c
U+0064 <> U+0434 ; d
U+0064 U+006A <> U+0452 ; dj
U+0064 U+007A U+0068 <> U+045F ; dzh
U+0064 U+0031 <> U+045F ; d1
U+0065 <> U+0435 ; e
U+0066 <> U+0444 ; f
U+0067 <> U+0433 ; g
U+0068 <> U+0445 ; h
U+0069 <> U+0438 ; i
U+006A <> U+0458 ; j
U+006B <> U+043A ; k
U+006B U+0068 <> U+0445 ; kh
U+006C <> U+043B ; l
U+006C U+006A <> U+0459 ; lj
U+006D <> U+043C ; m
U+006E <> U+043D ; n
U+006E U+006A <> U+045A ; nj
U+006F <> U+043E ; o
U+0070 <> U+043F ; p
;U+0071 <> ; q
U+0072 <> U+0440 ; r
U+0073 <> U+0441 ; s
U+0073 U+0068 <> U+0448 ; sh
U+0074 <> U+0442 ; t
U+0075 <> U+0443 ; u
U+0076 <> U+0432 ; v
;U+0077 <> ; w
U+0078 <> U+0445 ; x
;U+0079 ; y
U+007A <> U+0437 ; z
U+007A U+0068 <> U+0436 ; zh

; Additional (for official translitteration)
U+0110 <> U+0402 ; Đ
U+0111 <> U+0452 ; đ
U+017D <> U+0416 ; Ž
U+017E <> U+0436 ; ž
U+0106 <> U+040B ; Ć
U+0107 <> U+045B ; ć
U+010C <> U+0427 ; Č
U+010D <> U+0447 ; č
U+0044 U+017D <> U+040F ; DŽ
U+0044 U+017E <> U+040F ; Dž
U+0064 U+017E <> U+045F ; dž
U+0160 <> U+0428 ; Š
U+0161 <> U+0448 ; š

Then process it with

teckit_compile ascii-to-serbian.map

This will produce a file ascii-to-serbian.tec that you can put anywhere XeTeX will find it (in the working directory, for instance). Then make the following test file:

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX]{Linux Libertine O}
\newfontfamily{\serbianfont}[Mapping=ascii-to-serbian]{Linux Libertine O}
\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguage[Script=Cyrillic]{serbian}

\begin{document}
Serbian alphabet again

\begin{serbian}
A B V G D DJ E Zh Z I J K L LJ M N NJ O P R S T C1 U F Kh C Ch D1 Sh

a b v g d dj e zh z i j k l m n nj o p r s t c1 u f kh c ch d1 sh
\end{serbian} 
\end{document}

Sample output after xelatex test.tex

enter image description here

Note 1: the characters Џ and џ can be input also as DZH (or Dzh) and dzh. If this is incorrect (it might bring to incorrect ligatures) then remove the corresponding lines from ascii-to-serbian.map.

Note 2: if you find it inconvenient to type C1 and c1 to get Ћ and ћ, you can add the lines

U+0027 U+0043 <> U+040B ; 'C

and

U+0027 U+0063 <> U+040B ; 'c

after the C1 and c1 entries. This will allow you to input the characters as 'C and 'c.

If you want to input them as \'C and \'c, then insert this code after having loaded the Serbian language with Polyglossia

\let\standardcommandquote\'
\DeclareRobustCommand{\serbiancommandquote}[1]{%
  \ifnum\strcmp{#1}{c}=0 c1\else
    \ifnum\strcmp{#1}{C}=0 C1\else
      \standardcommandquote{#1}\fi\fi}
\makeatletter
\appto\blockextras@serbian{\let\'\serbiancommandquote}
\appto\inlineextras@serbian{\let\'\serbiancommandquote}
\appto\noextras@serbian{\let\'\standardcommandquote}
\makeatother

Note 3 (added Feb. 17): If one has available Unicode input, then also

Đ đ Ž ž Ć ć Č č DŽ Dž dž Š š

are mapped to

Ђ ђ Ж ж Ћ ћ Ч ч Џ џ Ш ш

respectively.

[Tex/LaTex] Create a mapping for transliteration from cyrillic to latin in LuaLaTeX

Edit 10/2017

Here is a new version of the Lua script. It generates the code for Luaotfload, instead of the font feature file, which is not supported anymore. It has also new option --back, which creates mapping in the opposite direction, like from Cyrillic to Latin in our example.

The script is named maptolua.lua:

kpse.set_program_name "luatex"
local lapp = require "lapp-mk4"
local uchar = unicode.utf8.char
local args

local function load_glyph_list(filename)
  local t = {}
  for line in io.lines(filename) do
    local glyph, code = line:match("([^;]+);([A-Fa-f0-9]+)")
    if glyph then
      code = string.upper(code)
      -- print(code, glyph)
      t[code] = glyph
    end
  end
  return t
end

local function load_map_file(mapfile, glyph_list)
  local glyph_list = glyph_list or {}
  local parse_codepoints = function(s)
    local t = {}
    local s = string.upper(s)
    for x in s:gmatch("U%+([0-9A-F]+)") do
      t[#t+1] = glyph_list[x] or "undefined"
    end
    return t
  end
  local get_chars = function(s)
    local t = {}
    for x in s:gmatch("U%+([0-9A-F]+)") do
      t[#t+1] = string.format('"%s"', uchar(tonumber(x, 16)))
    end
    return t
  end
  local t = {liga = {}, gsub = {}, ccmp= {}}
  for line in io.lines(mapfile) do
    -- search for 
    local lookup, replace = line:match("([^%<]+)<>([^%;]+)")
    if args.back then -- we can create reverse mapping from the map file
      lookup, replace = replace, lookup
    end
    -- process lines which define mappings
    if lookup then
      -- convert strings with unicode codepoints to tables with glyph names
      -- local lookups = parse_codepoints(lookup)
      local lookups = get_chars(lookup)
      -- local replaces = parse_codepoints(replace)
      local replaces = get_chars(replace)
      -- print(table.concat(lookups, ";"), "+", table.concat(replaces, ";"))
      local newt = {lookups = lookups, replaces = replaces}
      if #lookups > 1 then
        table.insert(t.liga, newt)
      elseif #replaces > 1 then
        table.insert(t.ccmp, newt)
      else 
        table.insert(t.gsub, newt)
      end
    end
  end
  return t
end

local function print_fea_file(script, language,  maptable)
  local function print_feature(feature, name, typ, key, value) 
    -- the field must be only one character long
    print("fonts.handlers.otf.addfeature {")
    print(string.format("\tname='%s',", name))
    print(string.format("\ttype='%s',", typ))
    print("\tdata={")
    for _, entry in ipairs(maptable[feature]) do
      local field = entry[key][1]
      local result = entry[value]
      if #result > 1 then
        local t = {}
        for _, s in ipairs(result) do
          t[#t+1] = string.format("%s", s)
        end
        result = "{" .. table.concat(t, ",") .. "}"
      else
        result = string.format("%s", result[1])
      end
      print(string.format("[%s] = %s,", field, result))
    end
    print("}}")
  end
  -- print(string.format("languagesystem %s %s;", script, language))
  print("\\directlua{")
  print_feature("liga", "liga", "ligature", "replaces", "lookups")
  print_feature("ccmp", "ccmp", "multiple", "lookups", "replaces")
  print_feature("gsub", "gsub", "substitution","lookups", "replaces")
  print "}"
end

args = lapp [[
maptolua.lua Convert teckit map files to Luaotfload feature tables
Usage:
texlua maptolua.lua [options] <map file> [glyph list file]
-l,--language  (default dflt) language name in OpenType format
-s,--script  (default LATN) script name in OpenType format
-b,--back  create back mapping
<map_file> (string) file to be converted
[glyph_list] (defualt glyphlist.txt) file in Adobe glyh list format with unicode to glyph names mapping
]]

-- if not arg[1] then
--   print "Usage:"
--   print "texlua maptofea.lua mapfile [glyph list] > featurefile.fea"
--   os.exit()
-- end

-- map files use Unicode values, we need to transform them to the glyph names
-- table with glyph list can be either passes as second argument, or one shipped in TL is used
local glyphfile = args.glyph_list or kpse.find_file("glyphlist.txt", "map")

local glyphtable = load_glyph_list(glyphfile)

-- load the map file, search for unicode values and replace them with glyph names
local maptable = load_map_file(args.map_file, glyphtable)

print_fea_file(args.script, args.language, maptable)

It can be executed like:

texlua maptolua.lua  cyr.map  > newfeat.tex

Which produces a following TeX file:

\directlua{
fonts.handlers.otf.addfeature {
    name='liga',
    type='ligature',
    data={
["–"] = {"-","-"},
["—"] = {"-","-","-"},
["”"] = {"'","'"},
["“"] = {"`","`"},
["¡"] = {"!","`"},
["¿"] = {"?","`"},
["„"] = {",",","},
["«"] = {"<","<"},
["»"] = {">",">"},
}}
fonts.handlers.otf.addfeature {
    name='ccmp',
    type='multiple',
    data={
["Ю"] = {"J","u"},
["Я"] = {"J","a"},
["ю"] = {"j","u"},
["я"] = {"j","a"},
["є"] = {"j","e"},
["Ѩ"] = {"J","e"},
["Щ"] = {"Š","Č"},
["щ"] = {"š","č"},
}}
fonts.handlers.otf.addfeature {
    name='gsub',
    type='substitution',
    data={
["'"] = "’",
["`"] = "‘",
["А"] = "A",
["Б"] = "B",
["В"] = "V",
["Г"] = "G",
["Д"] = "D",
["Е"] = "E",
["Ж"] = "Ž",
["З"] = "Z",
["И"] = "J",
["К"] = "K",
["Л"] = "L",
["М"] = "M",
["Н"] = "N",
["О"] = "O",
["П"] = "P",
["Р"] = "R",
["С"] = "S",
["Т"] = "T",
["У"] = "U",
["Ф"] = "F",
["Ц"] = "C",
["Ч"] = "Č",
["Ш"] = "Š",
["Э"] = "Ė",
["Ё"] = "Ë",
["а"] = "a",
["б"] = "b",
["в"] = "v",
["г"] = "g",
["д"] = "d",
["е"] = "e",
["ж"] = "ž",
["з"] = "z",
["и"] = "i",
["й"] = "j",
["к"] = "k",
["л"] = "l",
["м"] = "m",
["н"] = "n",
["о"] = "o",
["п"] = "p",
["р"] = "r",
["с"] = "s",
["т"] = "t",
["у"] = "u",
["ф"] = "f",
["ц"] = "c",
["ч"] = "č",
["ш"] = "š",
["э"] = "ė",
["ё"] = "ë",
["і"] = "i",
["І"] = "I",
["Х"] = "X",
["х"] = "x",
["И"] = "I",
["Ъ"] = "'",
["Ы"] = "Y",
["Ь"] = "’",
["ъ"] = "’",
["ы"] = "y",
["ь"] = "’",
}}
}

It can be used in the following way:

\documentclass{article}
\usepackage{fontspec}
\usepackage{ifluatex,ifxetex}
\input{newfeat.tex}
\setmainfont[RawFeature=+gsub;]{Linux Libertine O}


\begin{document}
\ifxetex
This is XeTeX
\else\ifluatex
This is LuaTeX
\fi\fi


Hello,, -- --- world я щ 
Здравствуй, Мир

\end{document}

(Note that it is necessary to use RawFeature=+gsub; in the font declaration)

And this is the result:

@edit added support for replacing one glyph with multiple new ones

LuaTeX doesn't support mapping files, but on the other hand it supports OpenType feature files. There is a major difference between the two, the first one works on character level and with unicode values, the other with glyph names.

I've created simple script for converting the map files to .fea files, maptofea.lua:

kpse.set_program_name "luatex"
local lapp = require "lapp-mk4"

local function load_glyph_list(filename)
  local t = {}
  for line in io.lines(filename) do
    local glyph, code = line:match("([^;]+);([A-Fa-f0-9]+)")
    if glyph then
      code = string.upper(code)
      -- print(code, glyph)
      t[code] = glyph
    end
  end
  return t
end

local function load_map_file(mapfile, glyph_list)
  local glyph_list = glyph_list or {}
  local parse_codepoints = function(s)
    local t = {}
    local s = string.upper(s)
    for x in s:gmatch("U%+([0-9A-F]+)") do
      t[#t+1] = glyph_list[x] or "undefined"
    end
    return t
  end
  local t = {liga = {}, gsub = {}, ccmp= {}}
  for line in io.lines(mapfile) do
    -- search for 
    local lookup, replace = line:match("([^%<]+)<>([^%;]+)")
    -- process lines which define mappings
    if lookup then
      -- convert strings with unicode codepoints to tables with glyph names
      local lookups = parse_codepoints(lookup)
      local replaces = parse_codepoints(replace)
      -- print(table.concat(lookups, ";"), "+", table.concat(replaces, ";"))
      local newt = {lookups = lookups, replaces = replaces}
      if #lookups > 1 then
        table.insert(t.liga, newt)
      elseif #replaces > 1 then
        table.insert(t.ccmp, newt)
      else 
        table.insert(t.gsub, newt)
      end
    end
  end
  return t
end

local function print_fea_file(script, language,  maptable)
  local function print_feature(feature) 
    print("feature " .. feature .. " {")
    for _, entry in ipairs(maptable[feature]) do
      print(string.format("  sub %s by %s;", table.concat(entry.lookups, " "), table.concat(entry.replaces, " ")))
    end
    print("} ".. feature .. ";")
  end
  print(string.format("languagesystem %s %s;", script, language))
  print_feature "liga"
  print_feature "ccmp"
  print_feature "gsub"
end

local args = lapp [[
maptofea.lua Convert teckit map files to OpenType feature files
Usage:
texlua maptofea.lua [options] <map file> [glyph list file]
-l,--language  (default dflt) language name in OpenType format
-s,--script  (default LATN) script name in OpenType format
<map_file> (string) file to be converted
[glyph_list] (defualt glyphlist.txt) file in Adobe glyh list format with unicode to glyph names mapping
]]

-- if not arg[1] then
--   print "Usage:"
--   print "texlua maptofea.lua mapfile [glyph list] > featurefile.fea"
--   os.exit()
-- end

-- map files use Unicode values, we need to transform them to the glyph names
-- table with glyph list can be either passes as second argument, or one shipped in TL is used
local glyphfile = arg[2] or kpse.find_file("glyphlist.txt", "map")

local glyphtable = load_glyph_list(glyphfile)

-- load the map file, search for unicode values and replace them with glyph names
local maptable = load_map_file(arg[1], glyphtable)

print_fea_file(args.script, args.language, maptable)

It's help message:

maptofea.lua Convert teckit map files to OpenType feature files
Usage:
texlua maptofea.lua [options] <map file> [glyph list file]
-l,--language  (default dflt) language name in OpenType format
-s,--script  (default LATN) script name in OpenType format
<map_file> (string) file to be converted
[glyph_list] (defualt glyphlist.txt) file in Adobe glyh list format with unicode to glyph names mapping

you can just simply use it without any options on a map file:

texlua maptofea.lua cyrillic-to-latin.map > cyrtolatn2.fea

the converted file cyrtolatn2.fea:

languagesystem LATN dflt;
feature liga {
  sub hyphen hyphen by endash;
  sub hyphen hyphen hyphen by emdash;
  sub quotesingle quotesingle by quotedblright;
  sub grave grave by quotedblleft;
  sub exclam grave by exclamdown;
  sub question grave by questiondown;
  sub comma comma by quotedblbase;
  sub less less by guillemotleft;
  sub greater greater by guillemotright;
} liga;
feature ccmp {
  sub afii10048 by J u;
  sub afii10049 by J a;
  sub iucyrillic by j u;
  sub iacyrillic by j a;
  sub ecyrillic by j e;
  sub Yuslittleiotifiedcyrillic by J e;
  sub afii10043 by Scaron Ccaron;
  sub shchacyrillic by scaron ccaron;
} ccmp;
feature gsub {
  sub quotesingle by quoteright;
  sub grave by quoteleft;
  sub afii10017 by A;
  sub afii10018 by B;
  sub afii10019 by V;
  sub afii10020 by G;
  sub afii10021 by D;
  sub afii10022 by E;
  sub afii10024 by Zcaron;
  sub afii10025 by Z;
  sub afii10026 by J;
  sub afii10028 by K;
  sub afii10029 by L;
  sub afii10030 by M;
  sub afii10031 by N;
  sub afii10032 by O;
  sub afii10033 by P;
  sub afii10034 by R;
  sub afii10035 by S;
  sub afii10036 by T;
  sub afii10037 by U;
  sub afii10038 by F;
  sub afii10040 by C;
  sub afii10041 by Ccaron;
  sub afii10042 by Scaron;
  sub afii10047 by Edotaccent;
  sub afii10023 by Edieresis;
  sub afii10065 by a;
  sub becyrillic by b;
  sub vecyrillic by v;
  sub gecyrillic by g;
  sub decyrillic by d;
  sub iecyrillic by e;
  sub zhecyrillic by zcaron;
  sub zecyrillic by z;
  sub iicyrillic by i;
  sub iishortcyrillic by j;
  sub kacyrillic by k;
  sub elcyrillic by l;
  sub emcyrillic by m;
  sub encyrillic by n;
  sub ocyrillic by o;
  sub pecyrillic by p;
  sub ercyrillic by r;
  sub escyrillic by s;
  sub tecyrillic by t;
  sub ucyrillic by u;
  sub efcyrillic by f;
  sub tsecyrillic by c;
  sub checyrillic by ccaron;
  sub shacyrillic by scaron;
  sub ereversedcyrillic by edotaccent;
  sub iocyrillic by edieresis;
  sub icyrillic by i;
  sub afii10055 by I;
  sub afii10039 by X;
  sub khacyrillic by x;
  sub afii10026 by I;
  sub afii10044 by quotesingle;
  sub afii10045 by Y;
  sub afii10046 by quoteright;
  sub hardsigncyrillic by quoteright;
  sub yericyrillic by y;
  sub softsigncyrillic by quoteright;
} gsub;

You have to request the feature file and also gsub opentype feature in the document:

\documentclass{article}
\usepackage{fontspec}
\usepackage{ifluatex,ifxetex}

\setmainfont[Mapping=cyrillic-to-latin,FeatureFile=cyrtolatn2.fea, RawFeature={+gsub;+liga;}]{Linux Libertine O}

\begin{document}
\ifxetex
    This is XeTeX
\else\ifluatex
    This is LuaTeX
\fi\fi

Hello,, -- ---  world я щ 
Здравствуй, Мир

\end{document}

and this is the result:

Upgrade for the 2016 TeXLive distribution.

The new release does not support the inclusion of a .fea file, making obsolete this method. A workaround can be made by the use of \directlua as follows:

\directlua{
    fonts.handlers.otf.addfeature {
        name = "myliga",
         type = "ligature",
         data = {
             ['Aacute'] = { "А", 0x0301},
             ['Eacute'] = { "Е", 0x0301},
             ['Iacute'] = { "И", 0x0301},
             ['iacute'] = { "и", 0x0301},
             ['Oacute'] = { "О", 0x0301},
             ['Uacute'] = { "У", 0x0301},
             ['Yacute'] = { "Ы", 0x0301},         
             ['Egrave'] = { "Э", 0x0301},         
             ['egrave'] = { "э", 0x0301},         
        },
    }
}

\directlua{
    fonts.handlers.otf.addfeature {
        name = "mycomp",
         type = "multiple",
         data = {
             afii10039 = { "C", "h" },
             afii10087 = { "c", "h" },
             afii10048 = { "J", "u" },
             afii10049 = { "J", "a" },
             afii10096 = { "j", "u" },
             afii10097 = { "j", "a" },
             Yuslittleiotifiedcyrillic = { "J", "e" },
             afii10043 = { "Scaron", "ccaron" },
             afii10091 = { "scaron", "ccaron" },
        },
    }
}

\directlua{
    fonts.handlers.otf.addfeature {
        name = "mysub",
            type = "substitution",
            data = {
                ["quotesingle"] = "quoteright",
                ["grave"] = "quoteleft",
                ["afii10017"] = "A",
                ["afii10018"] = "B",
                ["afii10019"] = "V",
                ["afii10020"] = "G",
                ["afii10021"] = "D",
                ["afii10022"] = "E",
                ["afii10024"] = "Zcaron",
                ["afii10025"] = "Z",
                ["afii10026"] = "I",
                ["afii10027"] = "J",
                ["afii10028"] = "K",
                ["afii10029"] = "L",
                ["afii10030"] = "M",
                ["afii10031"] = "N",
                ["afii10032"] = "O",
                ["afii10033"] = "P",
                ["afii10034"] = "R",
                ["afii10035"] = "S",
                ["afii10036"] = "T",
                ["afii10037"] = "U",
                ["afii10038"] = "F",
                ["afii10040"] = "C",
                ["afii10041"] = "Ccaron",
                ["afii10042"] = "Scaron",
                ["afii10047"] = "Edotaccent",
                ["afii10023"] = "Edieresis",
                ["afii10065"] = "a",
                ["afii10066"] = "b",
                ["afii10067"] = "v",
                ["afii10068"] = "g",
                ["afii10069"] = "d",
                ["afii10070"] = "e",
                ["afii10072"] = "zcaron",
                ["afii10073"] = "z",
                ["afii10074"] = "i",
                ["afii10075"] = "j",
                ["afii10076"] = "k",
                ["afii10077"] = "l",
                ["afii10078"] = "m",
                ["afii10079"] = "n",
                ["afii10080"] = "o",
                ["afii10081"] = "p",
                ["afii10082"] = "r",
                ["afii10083"] = "s",
                ["afii10084"] = "t",
                ["afii10085"] = "u",
                ["afii10086"] = "f",
                ["afii10088"] = "c",
                ["afii10089"] = "ccaron",
                ["afii10090"] = "scaron",
                ["afii10095"] = "edotaccent",
                ["afii10071"] = "edieresis",
                ["afii10103"] = "i",
                ["afii10055"] = "I",
                ["afii10026"] = "I",
                ["afii10044"] = "quoteright",
                ["afii10045"] = "Y",
                ["afii10046"] = "quoteright",
                ["afii10092"] = "quoteright",
                ["afii10093"] = "y",
                ["afii10094"] = "quoteright",
        },
    }
}

The features defined above must be called in the usual way.

\setmainfont{Linux Libertine O}[RawFeature={+mysub;+mycomp;+myliga}]

Best Answer

Related Solutions

[Tex/LaTex] Serbian Cyrillic using LuaTeX and XeTeX

[Tex/LaTex] Create a mapping for transliteration from cyrillic to latin in LuaLaTeX

Related Question