[Tex/LaTex] Complex ligatures in Devanāgarī

fontsfontspecligaturesxetex

Ligatures are known to the Western European languages, however it isn't that common to spot vertical or complex form of them. As an exercise in typography I tried to get some complex ligatures in Devanāgarī script.

I've downloaded the chandas.ttf file (Southern style; uttara.ttf would serve well as a test case) with complex ligatures in it as well as the Sanskrit 2003 font which doesn't contain complex ligatures to show common form of writing.

I've opened the font in FontForge where we can see those ligatures and their names. Ligatures are properly mapped, so we can use them. I only changed some letters to fit transliteration schemes: Y as nj or J, G as ng and I am using Lexilogos and Sanscript to get the portion of words.

a preview of complex ligatures in FontForge

Note: In case you are wondering what I am trying to achieve I can say that I try to convert transliterated words sorted in Xindy (Sanskrit, Pāḷi, hopefully even Tamiḻ and Siṇhala later) and I am checking what options I have to display index entries. As mapping is not supported in LuaTeX, I am trying to prepare standalone Lua scripts which will replace Latin letters.

In theory, I have two problems now:

1) How to turn off those complex ligatures locally in document and how to get regular form of writing, if needed? To preview a common form I used different font in the example below. After using otfinfo -f chandas.ttf we know there are three features, but it is not helping if we turn them off.

2) How to get complex ligatures in LuaLaTeX? As far as I know the support for Indic languages is very limited. Under normal circumstances I am using \char to get a specific glyph, but those complex ligatures are not mapped as Unicode glyphs in the private use areas (PUA). I would be able to use \XeTeXglyph and the glyph's slot, but it is not easy as well. FontForge is showing 3417 (0x0D59) for DNjYa glyph, but the actual position got from XeTeX (\the\XeTeXglyphindex"DNjYa") is 2510 (0x09CE). What a day!

A bonus. There is even more fun, if we try to get two complex ligatures next to each other (I cannot say if it is correct linguistically, it is not almost certainly), e.g. DNjJ+DNjJa, the middle letters form ligature earlier while typing it. The solution is to use \mbox{}, see the last line in the example.

I enclose an example and a preview of my efforts. We can run xelatex and lualatex. If you are interested, an encoding table can be obtained from http://www.sanskritweb.net/cakram/chandas-encoding.pdf and a preview of all ligatures from http://www.sanskritweb.net/cakram/saMyoga-pattra.pdf

% run: xelatex or lualatex mal-sanskrit.tex
\documentclass[a4paper]{article}
\pagestyle{empty}
\usepackage{ifxetex}
\usepackage{fontspec}
% Possible addtion for LuaLaTeX (1 line):
%\usepackage{luatextra}
% Possible addition for XeLaTeX (3 lines):
%\usepackage{polyglossia}
%\setmainlanguage{sanskrit}
%\newfontfamily{\devanagarifont}{Sanskrit2003}
\parindent=0pt
\begin{document}
Correct form in Xe\LaTeX, incorrect in Lua\LaTeX\ (2 lines):\par
\setmainfont[Script=Devanagari]{Sanskrit2003.ttf}
ड्ण्ज्ञ​ (common form; using different font [Sanskrit2003.ttf])\par % ड्ण्ज्ञ
\setmainfont[Script=Devanagari]{chandas.ttf}
ड्ण्ज्ञ​ (glyph DNjYa; DNjnja [Lexilogos] or DNjJa [Harvard-Kyoto])\par\medskip
  % Y as nj or J, G as ng
\setmainfont[Script=Devanagari,RawFeature=-liga;-mkmk;-mark]{chandas.ttf}
ड्ण्ज्ञ​ (a form of ligature; Devanāgarī script)\par
\setmainfont[Script=Latin]{chandas.ttf}
ड्ण्ज्ञ​ (incorrect form; Latin script)\par
\setmainfont[Script=Devanagari]{chandas.ttf}
ड्\mbox{}ण्\mbox{}ज्ञ​ (incorrect form; separated glyphs, \verb.\mbox.)\par\medskip
\ifxetex
3417 as \XeTeXglyph"0D59\ versus % 3417
\the\XeTeXglyphindex"DNjYa"\ as % 2510
\XeTeXglyph2510\ or\ \XeTeXglyph"09CE\ (getting slot number)\par
ड्ण्ज्ञ्ड्ण्ज्ञ / ड्ण्ज्ञ्\mbox{}ड्ण्ज्ञ (DNjJ DNjJa)
\fi
\end{document}

output from xelatex

output from lualatex

Best Answer

It may be the meaning(s) of the term 'ligature' could be the driver behind the question.


This is a comment with images, so not an answer. The original post (also not an answer, more of an observation) is kept down below, for continuity.

Assuming the question is about indexing transliterated content, then there is a method that involves no decomposition of displayed material.

A string of glyphs is given to a renderer, and the renderer displays it appropriately. The string of glyphs remains available, though, so reverting the display back into its input is not required.

For example, to explore the structure of the orthography, the string संयुक्त व्यंजन, which is what Google returns when "conjunct consonants in Hindi" is entered (and which Google transliterates the pronunciation of as "sanyukt vyanjan"), can be transliterated on a glyph-by-glyph basis as:

saṇya̺uka̺ta va̺yaṇjana

glyphs transliteration

using a mapping methodology whereby the inherent 'a' vowel attached to each consonant is shown, and also shown where it is switched off by the orthography rules:

  • before another vowel
  • in between two consonants where it is not needed
  • at the end of a word

Here, arbitrarily, the inverted under-bridge combining diacritical mark is being used (via a font mapping file) as a visual representation of the switched-off vowel in the first two cases.

The \index command can then take this transliteration string, like any other string, and do its usual work:

indexing example

Code

\documentclass[12pt]{article}
\usepackage{xcolor}
\usepackage{fontspec}
\setmainfont[Script=Devanagari]{Noto Serif Devanagari}
\newfontface\translitd[Mapping=devanagari-to-latin,Scale=1.1,Colour=red]{Noto Sans}
\newfontfamily\englishfont{Noto Serif}
\usepackage{polyglossia}
\setdefaultlanguage{hindi}
\setotherlanguages{english}
\usepackage{imakeidx}
\makeindex

\begin{document}
\Large
संयुक्त व्यंजन
{\normalsize\textenglish{sanyukt vyanjan}}

{\translitd संयुक्त}\index{{\translitd संयुक्त}}
{\translitd व्यंजन}\index{{\translitd व्यंजन}}


\printindex
\end{document}

'.map' file, to compile into a '.tec' file with teckit_compile.exe:

; TECkit mapping for TeX input conventions <-> Unicode characters

LHSName "devanagari-to-latin"
RHSName "UNICODE"

pass(Unicode)

; ligatures from Knuth's original CMR fonts
U+002D U+002D           <>  U+2013  ; -- -> en dash
U+002D U+002D U+002D    <>  U+2014  ; --- -> em dash

U+0027          <>  U+2019  ; ' -> right single quote
U+0027 U+0027   <>  U+201D  ; '' -> right double quote
U+0022           >  U+201D  ; " -> right double quote

U+0060          <>  U+2018  ; ` -> left single quote
U+0060 U+0060   <>  U+201C  ; `` -> left double quote

U+0021 U+0060   <>  U+00A1  ; !` -> inverted exclam
U+003F U+0060   <>  U+00BF  ; ?` -> inverted question

; additions supported in T1 encoding
U+002C U+002C   <>  U+201E  ; ,, -> DOUBLE LOW-9 QUOTATION MARK
U+003C U+003C   <>  U+00AB  ; << -> LEFT POINTING GUILLEMET
U+003E U+003E   <>  U+00BB  ; >> -> RIGHT POINTING GUILLEMET



U+0924 <> U+0074 U+0061 ;  ta 
U+094D <> U+033A ; strikeout previous
U+0915 <> U+006B U+0061 ; ka
U+0941 <> U+033A U+0075 ; -u
U+092F <> U+0079 U+0061 ; ya
U+0902 <> U+006E U+0323 ; n.
U+0938 <> U+0073 U+0061 ; sa
U+0928 <> U+006E U+0061 ; na
U+091C <> U+006A U+0061 ; ja
U+0935 <> U+0076 U+0061 ; va

I would ordinarily expect the reader to want a 'normal' index, as well. Something like:

normal index

====

Original post

Looks OK for normal words (in xelatex and in the browser), if I have not misunderstood the question.

conjuncts

Since lualatex does not do conjunct consonants in the first place, there is no need to 'de-ligature' them to create the index entries.

For indexing by (automated) transliteration, again, xelatex is easier, using a font-map (or l3 regex replace).


\documentclass[12pt]{article}
\usepackage{fontspec}
\setmainfont[Script=Devanagari]{Noto Serif Devanagari}
\newfontfamily\englishfont{Noto Serif}
\usepackage{polyglossia}
\setdefaultlanguage{hindi}
\setotherlanguages{english}


\begin{document}
\Large
संयुक्त व्यंजन

{\normalsize\textenglish{sanyukt vyanjan}}

\noindent   शुक्ल ख्मेर मुख्य अंग्रेज़ी \\ 
    अच्छा छुट्टी ठ्रेइन बुद्ध विद्यार्थी


\noindent {\normalsize\textenglish{shukla khmer mukhya angrezî \\ achchhâ chhuTTî trein buddha vidyârthî}


ल् +    म = ल्म     फ़िल्म  \textenglish{film}
\end{document}

Test words from https://en.wikibooks.org/wiki/Hindi/Consonant_combinations

Related Question