What algorithm does LuaLaTeX use for fallback fonts

luaotfloadluatex

Background
I have been busy with a book about the worlds scripts and languages on and off for a couple of years. It covers all the listed scripts in unicode, as well as a dozen or so, where there is no unicode standard yet and I use images instead of fonts.

With Noto fonts, unicode LuaLaTeX and l3 maturing, I have been able to print a reasonable range for all scripts, as needed in the write up. With the exception of East Asian scripts, that I only have a few pages only per script. I use as a main font Brill and I added fallback fonts to cover the rest of the scripts. Book so far, hovers around 350 pages and I anticipate that it will run to a final size of 600 pages. To cover the unicode points the fonts need to provide +-150,000 glyphs. Not all are codepoints are used in the book, as I mentioned earlier, in my estimation I only need about half of that. Obviously and understandably compilation speed is an issue, so I am looking to understand the algorithm used by luaotfload-fallback.lua ato try and see if I can improve the processing time. I am looking at strategies to optimize compilation times, not only for my document, but in general.

I have identified bottlenecks in mainly three areas a) fonts b) images c) logging (disk writes in general). Images I will use a preproceesor and optimize all of them as well as produce pdfs. I will use Golang for the preprocessor, which can also do marking if needed. Ideas for fonts and logging see below.

  1. I have this (crazy) idea that the glyph info required at the nodes during processing, be obtained via a local server, so some tasks can be externalized and run with concurrency. I am thinking of some form of priority que, so that data for codepoints used frequently can be served fast and any unused codepoints on a second run, be taken out of the cache. Again here I will use Golang and sqlite3 since everything is local. I have a Lua table at the moment, which maps unicode points to fonts, based on a config file.

  2. All logging to be also sent to a server rather than written on disk. Can also be done for the aux files.

  3. The generation of the pdf also takes time, but I am undecided at this point if it can be optimized.
    Current compilation speed is about 1.3 seconds per page + an initial of 30-40 seconds.

Question
Can someone explain to me, the algorithmic steps in luaotfload-fallback.lua? When and how is this used by LuaTeX when building a document? At which point are the glyph info needed? Any ideas welcome. Thank you for reading this far.

Best Answer

This doesn't answer the question in the question title at all, but I think that this addresses the issues presented in the question body (hopefully).

Indirect Answer

Here's a solution that loads 231 unique fonts and prints 83 020 unique characters (103 pages) in 7.505 seconds (on average) using LuaLaTeX.

First, run this script to download all the fonts:

#!/bin/sh
set -eu

mkdir fonts
cd fonts

git clone --depth 1 --no-checkout --filter=blob:none \
    https://github.com/notofonts/notofonts.github.io.git
cd notofonts.github.io
git sparse-checkout set --no-cone '!/*' '/fonts/**/hinted/ttf/*-Regular.ttf'
git checkout main
cd ..

git clone --depth 1 --no-checkout --filter=blob:none \
    https://github.com/notofonts/noto-cjk.git
cd noto-cjk
git sparse-checkout set --no-cone '!/*' '/Serif/SubsetOTF/**/*-Regular.otf'
git checkout main
cd ..

wget -O unifont-Regular.otf \
    https://unifoundry.com/pub/unifont/unifont-15.1.04/font-builds/unifont-15.1.04.otf
wget -O unifont_upper-Regular.otf \
    https://unifoundry.com/pub/unifont/unifont-15.1.04/font-builds/unifont_upper-15.1.04.otf

wget -O NotoEmoji-Regular.ttf \
    "$(curl 'https://fonts.googleapis.com/css2?family=Noto+Emoji' | grep -o 'https.*ttf')"

cd ..

Then, place the following in all-characters.lua:

-- Save some globals for speed
local ipairs = ipairs
local max = math.max
local new_node = node.new
local node_write = node.write
local pairs = pairs

-- Define some constants
local GLUE_ID = node.id("glue")
local GLYPH_ID = node.id("glyph")
local SIZE = tex.sp("10pt")

-- Get all the fonts
local fontpaths = dir.glob("**-Regular.*", "./fonts")

-- Sort the fonts such that the "preferred" fonts are last
table.sort(fontpaths, function(a, b)
    local a = file.nameonly(a):match("(.+)-Regular")
    local b = file.nameonly(b):match("(.+)-Regular")

    if a:match("Serif") and not b:match("Serif") then
        return false
    end
    if b:match("Serif") and not a:match("Serif") then
        return true
    end
    if a:match("unifont") and not b:match("unifont") then
        return true
    end
    if b:match("unifont") and not a:match("unifont") then
        return false
    end
    if #a == #b then
        return a > b
    end
    return #a > #b
end)


-- Create a mapping from codepoint to font id
local by_character = {}
local virtual_fonts = {}

for _, filename in ipairs(fontpaths) do
    local fontdata = fonts.definers.read {
        lookup = "file",
        name = filename,
        size = SIZE,
        features = {},
    }
    local id = font.define(fontdata)
    fonts.definers.register(fontdata, id)

    virtual_fonts[#virtual_fonts + 1] = { id = id }

    for codepoint, char in pairs(fontdata.characters) do
        if char.unicode == codepoint then
            by_character[codepoint] = {
                width = char.width,
                height = char.height,
                depth = char.depth,
                font = id,
                commands = {
                    { "slot", #virtual_fonts, codepoint }
                },
            }
        end
    end
end

local function print_all_chars()
    local count = 0

    tex.forcehmode()
    for codepoint, data in table.sortedpairs(by_character) do
        local glyph = new_node(GLYPH_ID)
        glyph.font = data.font
        glyph.char = codepoint

        local space = new_node(GLUE_ID)
        space.width = max(2 * SIZE - glyph.width, 0)
        glyph.next = space

        node_write(glyph)
        count = count + 1
    end
    tex.sprint("\\par Characters: " .. count)
    tex.sprint("\\par Fonts: " .. #virtual_fonts)
end


-- Make the virtual font
local id = font.define {
    name = "all-characters",
    parameters = {},
    characters = by_character,
    properties = {},
    type = "virtual",
    fonts = virtual_fonts,
}

local new_command
if ltx then
    new_command = function(name, func)
        local index = luatexbase.new_luafunction(name)
        lua.get_functions_table()[index] = func
        token.set_lua(name, index, "protected")
    end
elseif context then
    new_command = function(name, func)
        interfaces.implement {
            name = name,
            actions = func,
            public = true,
        }
    end
end

new_command("printallchars", print_all_chars)
new_command("allcharactersfont", function() font.current(id) end)

Then, you can print all the characters using the following document:

\documentclass{article}

\ExplSyntaxOn
\lua_load_module:n { all-characters }
\ExplSyntaxOn

\begin{document}
    \printallchars
\end{document}

ConTeXt is 50% faster at 4.849 seconds on average:

\ctxloadluafile{all-characters}

\starttext
    \printallchars
\stoptext

More usefully, this also defines a virtual font \allcharactersfont that contains characters from all the loaded fonts:

\documentclass{article}
\pagestyle{empty}

\ExplSyntaxOn
\lua_load_module:n { all-characters }
\ExplSyntaxOn

\begin{document}
    {\allcharactersfont
        A Ξ Ж س
        क ௵ ෴ ფ
        ጄ ᑠ ᘟ Ⅶ
        ∰ ⡿ だ 㬯
        ䷥ 𐎠 𒅪 𓈥
        𘎡 𝄟 𝔸 🦆
    }
\end{document}

output

Direct Answer

  1. I have this (crazy) idea that the glyph info required at the nodes during processing, be obtained via a local server, so some tasks can be externalized and run with concurrency. I am thinking of some form of priority que, so that data for codepoints used frequently can be served fast and any unused codepoints on a second run, be taken out of the cache. Again here I will use Golang and sqlite3 since everything is local. I have a Lua table at the moment, which maps unicode points to fonts, based on a config file.

The document below loads all 231 fonts in 2.426 seconds on average, so there's not much room to speed up the font loading.

\ExplSyntaxOn
\lua_load_module:n { all-characters }
\csname@@end\endcsname

If you did still want to speed it up, the easiest way would be to place the font files and luaotfload caches in a RAM disk.

  1. All logging to be also sent to a server rather than written on disk. Can also be done for the aux files.

Aside from some package initialization spam and overfull box warnings, your document shouldn't be producing that much log output. If you do have that much output, then I'd try and reduce the amount of output rather than trying to optimize it.

  1. The generation of the pdf also takes time, but I am undecided at this point if it can be optimized. Current compilation speed is about 1.3 seconds per page + an initial of 30-40 seconds.

Disabling PDF compression can help a little, but 1.3 seconds per page suggests that something else is going on.

Another common issue is complicated TikZ figures, so if you're drawing any glyphs with TikZ then you should externalize and cache them.

Loading images can also be slow, so if you're loading a bunch of characters as individual files, then it's quite a bit faster to combine them all into a single PDF file and select the character by page number. pdfTeX (and maybe LuaTeX too?) closes each opened PDF file after every page, so it's much faster to load all the pages/characters into individual boxes at the start of each run than it is to reload the PDF file each time. (Or better yet, see the suggestion below.)

as well as a dozen or so, where there is no unicode standard yet and I use images instead of fonts.

[...]

Images I will use a preproceesor and optimize all of them as well as produce pdfs

If you have the character images available as SVG files, then my (unreleased/experimental) unnamed-emoji package solves almost this exact problem. There's a little bit of end-user documentation, but for actually building the “font” files you'll need to use the Makefile as a rough guide.

Related Question