[Tex/LaTex] How does LaTeX know the page number of a reference

cross-referencing

I am creating documents with various references, citations, an index, table of contents, and other linked elements. How does LaTeX know how to properly cross reference a page while properly formatting a page and keeping the page numbers correct?

How does Latex know that inserting a page reference won't modify the page of the referenced label?

For example, if I reference a section later in my document with \ref{sec:bestsection} then LaTeX must figure out later what page \label{sec:bestsction} is on and then go back and insert that into the \ref position on the second compile (see this answer).

However, suppose I have a large document and it happens that the sec:bestsection is on page 10230. Also, suppose that this large page number happens to add enough space to the document that it pushes another word off the end of the page. One thing leads to another, and now sec:bestsection is on page 10231.

How does LaTeX know that this happened and not use the wrong page number? Does it always assume that a \ref will fit on the line?

Best Answer

The answer to your question in the title is “LaTeX doesn't know”, at least not completely.

How does LaTeX manage cross-references?

Suppose you have

As we shall see in section~\ref{sec!main-results}, ...

...[many pages later]...

\section{Main results}\label{sec!main-results}

We are now ready to prove the most important theorems.

When the \label is seen, the page with \ref has long been typeset and output: there's no way to “get back and fix the number”.

So the approach is not “get all the cross references right at once”, because it would need keeping the whole document in memory, before doing any typesetting. Instead, LaTeX writes a note as soon as it outputs a page where a \label command appeared: in the .aux file you'll find something like

\newlabel{sec!main-results}{{3}{9}}

where the first number is the section number and the second one is the page number. The note is written out only when shipping out a page, because only then the page number is really known. Remember that TeX always looks ahead and only typesets full paragraphs, before deciding for a page break.

At the end of the job, the .aux file is closed and input. At such time \newlabel gets a suitable definition, whose purpose is to check whether the label was already known from a previous run and, in this case, whether one of the associated numbers has changed.

This is the point where you can see warnings such as

Label(s) may have changed. Rerun to get cross-references right

There were multiply-defined labels

Label `<label>' multiply defined

that should be self-explanatory.

At the start of a job, when LaTeX is processing \begin{document}, the .aux file is input and \newlabel gets a different definition, which allows for \ref{sec!main-results} or \pageref{sec!main-results} to print the proper number. But what number? In the case above, the section number will be 3, even in case you have added a whole section between runs. Only at the end of the job, LaTeX will know the number has changed and it will issue the first of the warnings listed above.

If a cross-reference is unknown, just ?? will be printed and the warning about changed label will be issued. If a \ref or \pageref command refers to a label not yet defined, you get the warning

There were undefined references

but also

Reference `<label>' on page <page> undefined

that will tell you that, maybe, you have misspelled the label.

Should we care about the size of a cross-reference?

Should we care about the space used by the reference? Not really. Paragraphs usually have enough flexibility to allow for shrinking or stretching a line without modifying substantially the output. This is not completely foolproof and there are examples around of cleverly written documents that never stabilize: each new run of LaTeX will change the page number associated to a label so it never remains the same. However, the chances that this happens in a real document are pretty small.

What about multiple runs?

Document processors such as latexmk are able to look into the .log file for warnings about changed labels or undefined references and trigger a new run for fixing the output. However it's not so important that at each point in time the cross-references are correct: they'll be when you get no warning like the ones above.

What about the .aux file?

The .aux file is used for several other purposes: citations, for example, but also other administrative tasks. Packages, notably hyperref, can modify the annotations made, by extending the syntax for the two versions of \newlabel, but dealing with this would be too long. The idea is still the same.

Important note. It's clear from this description, that preserving the integrity of the .aux file between runs is essential. This file should generally not be removed, unless it has become corrupt because of some fatal error. An incomplete annotation might cause an error when the file is input: interrupting the LaTeX run at this error will preserve the same corrupt file, so at the next run the same error will reappear. In such cases, removing the .aux file is the only remedy. Not a big deal, it will cost a new run of LaTeX (maybe two). But, of course, removing a correct .aux file at the end of a run will always produce errors about undefined reference.

Finally, there is a switch that makes LaTeX not touch any of the file it writes out: if you add \nofiles in the preamble, the .aux file and the ones used for the table of contents and similar lists will only be input and not rewritten. It's a relic of the past, when even writing to a file or just keeping some open caused delays, so when one was sure that cross-reference and lists were correct, adding \nofiles saved some running time. Nowadays, the overhead is so small that such a trick is almost useless.

Related Question