[Tex/LaTex] How does a TeX engine read and render the input stream

tex-core

I am new to TeX programming and I have not finished reading the TeXBook yet. I need at a glance tutorial about how a TeX engine reads, processes, and produces its output.

My mental model as a newbie when considering any black box receiving an input stream and producing an output stream is as follows.

The black box

  1. takes a chunk of characters of a constant length,
  2. processes the taken chunk of characters,
  3. save the processed chunk in a file,
  4. does 1, 2, 3 until all input stream of characters gets processed.

Does a TeX engine use the chunk-by-chunk processing mechanism? Or does it do multiple processing from start to finish as follows?

A TeX engine

  1. scans all input stream (in an input file) from start to finish,
  2. does the first expansion.
  3. repeat 1 followed by 2 for the nth expansion until no more expansion possible.
  4. saves the rendered output to a file, for example, in PDF format.

The answer of this question really helps me to learn plain TeX faster than reading the TeXBook.

Edit 1 (Sept 9, 2014)

Rather than spawning a new related question, I think it is better to just edit this question.

It is still hard for me to digest the algorithm how TeX engine works without an example. That is why here I provide a simple example.

The comments unit * are intentionally added to ease referencing.

% unit A
\def\aa[#1]{a's value: #1}
\def\bb{[Hi!]}
\def\cc{\bb}
\def\dd{\cc}

% unit B
\expandafter\expandafter\expandafter\expandafter
\expandafter\expandafter
\expandafter
\aa\dd


% unit C
\expandafter\expandafter
\expandafter
\aa\cc

% unit D
\expandafter
\aa\bb

\bye 

Could you elaborate your answer using this example? The points I want to know:

  1. How TeX know when to proceed to the next unit.
  2. Is it my understanding correct that scanning starts from top to bottom, proceed to the next unit once after the previous unit has been completely expanded and executed? I mean when processing unit B to complete its expansion and execution, the remaining units (C and D) are untouched.

Best Answer

TeX has three modes of operation: (1) Converting the input stream to tokens, (2) expanding the token, (3) executing a complete command (made up of tokens).

In more detail, it first prepares one line of input by stripping off the EOL (OS dependent) and all spaces from the end of the line (usually also tab characters). Then it adds its own endline character (the value of \endlinechar, normally ctrl-M).

It then reads the line one character at a time until it obtains a complete token (generally a single character or a control sequence). This token is passed to stage (2) the macro expansion engine, which expands it (if it can be) or asks for more tokens from stage (1) (if needed for the expansion process) or passes it along to stage (3), the execution process. This process either executes the command, or asks for more tokens from stage (2), which may then have to ask for more tokens from stage (1).

The tokenizing stage converts most characters to a token consisting of a character-code/category-code pair. Since the execution engine is capable of assigning new category codes to a character, it can influence the behavior of stage (1).

One slight complication is the existence of an endline character within a line. TeX considers that the end and ignores the rest of the line. Also a comment character (normally %) causes the rest of the line to be ignored, including the endline character.

Note that tokenization happens within a line (you cannot start a macro token on one line and finish it on the next), but expansion -- step (2) -- can request more tokens and that can continue over to the next line.

I don't know whether you can put this process under either of your two schemes. The "chunks" are either lines or tokens, neither of which have constant length, so the first description is out. And certainly TeX doesn't read a whole file at once (except perhaps it may move it from disk to an input buffer for efficiency).

As for outputting the results: TeX sends it a page at a time to the output. It evaluates whether a full page has been obtained after each paragraph has been processed. When it breaks paragraphs into lines (along with hyphenation and margin justifying), it operates on an entire paragraph at a time.

Related Question