[Tex/LaTex] Add page and line numbers to a pdf

line-numberingpage-numberingpdf

Is there any quick script to add page and line numbers to each page of a pdf document?

  1. I often enough get articles in pdf to review, with no page number. I end up writing them by hand to refer to each page when pointing errors.

  2. When referring an error, I end up counting by hand the lines from the beginning or from the end, or copying the context, to precise the location of the error. It would be much more practical to have a standard way to add line numbers to an existing document.

I could manage editing the LaTeX source to obtain this, but not when I receive a pdf. PDF format does not contain lines per se, so identifying them would require to cluster the $y$-coordinates of the letters, and adding those numbers in the margin would require to take the min of the $x$-coordinates and remove a fixed amount from it. Anybody did this script already, or seen another way?

Best Answer

Alright, here's a go at numbering lines in a PDF (or any other image format) without access to the source.

I wrote a little shell script that, using ImageMagick (at least version 6.6.9-4), converts a given PDF into separate raster images for each page, splits these into half pages, shrinks them to a width of one pixel (so takes the horizontal average, basically), turns this into a monochrome image with a given threshold (black=text, white=no text), shrinks every black sequence down to one pixel (=middle of a line), outputs this as a text, pipes it to sed to clean it up and remove all the non-text lines and finally writes a txt file with the position of each line as 1/1000 of the text height.

findlines.sh:

convert $1.pdf -crop 50x100% png:$1
for f in $1-*; do 
convert $f -flatten -resize 1X1000! -black-threshold 99% -white-threshold 10% -negate -morphology Erode Diamond -morphology Thinning:-1 Skeleton -black-threshold 50% txt:-| sed -e '1d' -e '/#000000/d' -e 's/^[^,]*,//' -e 's/[(]//g' -e 's/:.*//' -e 's/,/ /g' > $f.txt;
done

Running the script takes about 1 second for one page, resulting in a number of files: basename-<number>.txt, where odd <numbers> contain the positions of the left line numbers, and even <numbers> those of the right page numbers. These files can then be read by pgfplotstable (at least v 1.4) and be used to typeset the line numbers on top of the imported pdf file. I defined a command that takes the page number and four line numbers as arguments, where the four line numbers are used to tell the macro at which "raw" line numbers the "real" text lines start and end in the left and right column. By setting \pgfkeys{print raw line numbers=true}, the raw line numbers as found by the algorithm are shown in red.

\documentclass{article}
\usepackage{tikz}
\usepackage{pgfplotstable}

\newif\ifprintrawlinenumbers
\pgfkeys{print raw line numbers/.is if=printrawlinenumbers,
  print raw line numbers=true}
\newcommand{\addlinenumbers}[5]{
  \pgfmathtruncatemacro{\leftnumber}{(#1-1)*2}
  \pgfmathtruncatemacro{\rightnumber}{(#1-1)*2+1}
  \pgfplotstableread{\pdfname-\leftnumber.txt}\leftlines
  \pgfplotstableread{\pdfname-\rightnumber.txt}\rightlines
  \begin{tikzpicture}[font=\tiny,anchor=east]
  \node[anchor=south west,inner sep=0] (image) at (0,0) {\includegraphics[width=14cm,page=#1]{\pdfname.pdf}};
    \begin{scope}[x={(image.south east)},y={(image.north west)}]
      \pgfplotstableforeachcolumnelement{[index] 0}\of\leftlines\as\position{
        \ifprintrawlinenumbers
          \node [font=\tiny,red] at (0.04,1-\position/1000)         {\pgfplotstablerow};
        \fi
        \pgfmathtruncatemacro{\checkexcluded}{
          (\pgfplotstablerow>=#2 && \pgfplotstablerow<=#3) ? 1 : 0)
        }
        \ifnum\checkexcluded=1
          \pgfmathtruncatemacro\linenumber{\pgfplotstablerow-#2+1}
          \node [font=\tiny,align=right,anchor=east] at (0.08,1-\position/1000) {\linenumber};
        \fi
      }
      \pgfplotstablegetrowsof{\leftlines}
      \pgfmathtruncatemacro\rightstart{min((\pgfplotsretval-#2),(#3-#2+1))}
      \pgfplotstableforeachcolumnelement{[index] 0}\of\rightlines\as\position{
        \ifprintrawlinenumbers
          \node [font=\tiny,red,anchor=east] at (1.0,1-\position/1000) {\pgfplotstablerow};
        \fi
        \pgfmathtruncatemacro{\checkexcluded}{
                  (\pgfplotstablerow>=#4 && \pgfplotstablerow<=#5) ? 1 : 0)
        }
        \ifnum\checkexcluded=1
          \pgfmathtruncatemacro\linenumber{\pgfplotstablerow-#4+\rightstart+1}
          \node [font=\tiny] at (0.96,1-\position/1000) {\linenumber};
        \fi
      }
    \end{scope}
  \end{tikzpicture}
}

\begin{document}

\def\pdfname{article}
\addlinenumbers{1}{20}{50}{2}{65}
\pgfkeys{print raw line numbers=false}
\addlinenumbers{2}{0}{69}{0}{64}
\addlinenumbers{3}{19}{47}{21}{48}

\end{document}

As a proof of concept, here's the output for the first two pages of an article from the Environmental Science & Technology Journal. I think it works really well. I haven't been able to call findlines.sh from within LaTeX, though, this step has to be performed manually before compiling the .tex file.

first page of a pdf with line numbers

second page of a pdf with line numbers

Related Question