[Tex/LaTex] LuaLaTeX: How to use a \char directive inside a string.gsub function

luatexstrings

Consider the following MWE:

% !TEX TS-program = lualatex
\documentclass{article}

\usepackage{fontspec} 
\setmainfont[Ligatures=NoCommon]{Latin Modern Roman}

\usepackage{luacode,luatexbase} 
\begin{luacode}
function dosub ( s )
   s =  string.gsub ( s , 'ff', '\\char64256{}') 
   return ( s )
end
--luatexbase.add_to_callback ( "process_input_buffer", dosub, "dosub" )
\end{luacode}

\begin{document}
off \directlua{ tex.sprint ( dosub ( \luastring{off} ) ) } off
\end{document}

The heart of the code is the function dosub, which employs the Lua function string.gsub. It is set to replace instances of ff with the glyph that contains the ff ligature. (You'll have to trust me that, for the font at hand, the ff-ligature glyph is located in "slot" 64256.) Note that, for now, the instruction luatexbase.add_to_callback instruction is commented out. (A -- (double dash) string initiates a Lua comment.)

When this MWE is run, one gets:

enter image description here

Observe that the middle word, which is generated via a \directlua call to dosub, correctly contains the ff-ligature, whereas the first and third words do not (once again correctly, since automatic ligature generation is disabled).

The trouble starts when I uncomment the instruction

luatexbase.add_to_callback ( "process_input_buffer", dosub, "dosub" )

Upon recompiling, the following, fairly incomprehensible, error message results:

(/usr/local/texlive/2015/texmf-dist/tex/context/base/supp-pdf.mkii

[Loading MPS to PDF converter (version 2006.09.02).]

\scratchcounter=\count290

\scratchdimen=\dimen261

\scratchbox=\box256

! Missing number, treated as zero.

\let

l.275 \let

\pdflastform=\pdflastxform

?

I suspect this is somewhat related to the presence of a TeX macro — \char — in the replacement string part of the string.gsub function. To wit, if I replace '\\char64256{}' with gg (i.e., a constant string), no error message is generated (and the three instances of "ff" in the body of the document are automatically replaced with "gg").

Do I need to "wrap" or "protect" the TeX macro in some special way in order to enable the successful use of "luatexbase.add_to_callback"? Is there something else I should do? About my computing setup: I'm running MacTeX2015 (with all available updates thru this morning applied) on a MacBookPro running MacOSX 10.10.5 "Yosemite".

Best Answer

There is function unicode.utf8.char for direct unicode character inserting in Lua functions:

% !TEX TS-program = lualatex
\documentclass{article}

\usepackage{fontspec} 
\setmainfont[Ligatures=NoCommon]{Latin Modern Roman}

\usepackage{luacode,luatexbase} 
\begin{luacode}
local uchar = unicode.utf8.char
function dosub ( s )
   s =  string.gsub ( s , 'ff', uchar(64256)) 
   return ( s )
end
\end{luacode}

\AtBeginDocument{%
      \luaexec{luatexbase.add_to_callback ( "process_input_buffer", dosub, "dosub" )}%
    }

\begin{document}
off \directlua{ tex.sprint ( dosub ( \luastring{off} ) ) } off
\end{document}

enter image description here

But the main issue in your code is that the callback is inserted too early and it probably replaces ff chars in some macros loaded in \AtBeginDocument. So other solution is to insert the callback in \AtBeginDocument as well, which reduces the risk of such collision (you should do that even in the first method):

% !TEX TS-program = lualatex
\documentclass{article}

\usepackage{fontspec} 
\setmainfont[Ligatures=NoCommon]{Latin Modern Roman}

\usepackage{luacode,luatexbase} 
\begin{luacode}
function dosub ( s )
   s =  string.gsub ( s , 'ff', '\\char64256{}') 
   return ( s )
end
\end{luacode}
\AtBeginDocument{%
  \luaexec{luatexbase.add_to_callback ( "process_input_buffer", dosub, "dosub" )}%
}


\begin{document}
off \directlua{ tex.sprint ( dosub ( \luastring{off} ) ) } off
\end{document}

Edit:

There is also another catch, what if your document body include some macro with ff in a name? To fix that, we can use such function:

\begin{luacode}
local uchar = unicode.utf8.char

function dosub ( s )
  local x = s:gsub('(\\?)([%a%@]+)', function(back,text)
     if back~="" then 
        return back .. text  
     end
     return  text:gsub ( 'ff', uchar(64256)) 
  end)
  print("x", x)
  return x
end
luatexbase.add_to_callback ( "process_input_buffer", dosub, "dosub" )
\end{luacode}

with s:gsub('(\\?)([%a%@]+)', function(back,text) we catch all words, including macros. If variable back is not empty string, the current word is a macro and we need to return it unprocessed. Otherwise, we can apply ff replacing regexp.

Note that in this case add_to_callback is used without AtBeginDocument, because when \offer macro is defined in the preamble, it's text wouldn't be replaced. Because we now skip macros, it shouldn't matter.

And as closing remarks I would add that node processing callbacks are much better for this kind of hacks, exactly because of these problems with macros.

For instance the following code:

local uchar = unicode.utf8.char
local fchar = string.byte("f")
local glyph_id = node.id("glyph")
local glue_id = node.id("glue")


local function next_status(n, node_table)
  local node_table = node_table or {}
  table.insert(node_table, n)
  if not n then return false end
  if n.id == glyph_id and n.char == fchar then
    return true, node_table
  elseif n.id == glyph_id or n.id == glue_id then 
    return false
  else
    return next_status(n.next, node_table)
  end
end


local function node_dosub(nodes)
  for n in node.traverse(nodes) do
    if n.id == glyph_id and n.char == fchar then
      local next, node_table = next_status(n.next)
      if next == true then
        n.char =  64256
        for _, x in ipairs(node_table) do
          node.remove(nodes, x)
        end
      end
    end
  end
  return nodes
end

luatexbase.add_to_callback ( "pre_linebreak_filter", node_dosub, "node_dosub" )

it is more complicated, because we can't operate on string level, but on individual nodes. lot of node types exists, glyph nodes with node.id 37 are important for us. every glyph node has char field, holding the character code. When glyph with f character is found, we peek next nodes to find whether there is another f glyph next to this one. when it is found, we replace current character with code for ff ligature and delete next f glyph.

Related Question