[Tex/LaTex] Get the automatically produced line and page breaks from LaTeX

line-breakingpage-breaking

My aim is to produce an html-file with the same text as the pdf produced by LaTeX- The html shall represent the pagination and line-break structure of the pdf: When there is a linebreak in the pdf I want to produce a <br> in html, when there is a paragraph I want to produce a <p> in html, when there is a newpage in the pdf I want to produce a horizontal line in html.

Handling of the paragraphs is easy since they are defined in the input file. But line-breaking and pagination depends on the font and on the width and height of the document (and maybe on some other things I cannot even imagine yet).

Is there a way of getting LaTex to tell me where it broke the lines and where it started a new page?

Best Answer

This latex:

\documentclass{article}
\showoutput
\usepackage{lipsum}
\begin{document}

\lipsum

\end{document}

Produces a log file showing the position of all the output:

.....
LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 4.
LaTeX Font Info:    ... okay on input line 4.
LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 4.
LaTeX Font Info:    ... okay on input line 4.

Completed box being shipped out [1]
\vbox(633.0+0.0)x407.0
.\glue 16.0
.\vbox(617.0+0.0)x345.0, shifted 62.0
..\vbox(12.0+0.0)x345.0, glue set 12.0fil
...\glue 0.0 plus 1.0fil
...\hbox(0.0+0.0)x345.0
..\glue 25.0
..\glue(\lineskip) 0.0
......
...\hbox(6.94444+1.94444)x345.0, glue set 0.85849
....\hbox(0.0+0.0)x15.0
....\OT1/cmr/m/n/10 L
....\OT1/cmr/m/n/10 o
....\OT1/cmr/m/n/10 r
....\OT1/cmr/m/n/10 e
....\OT1/cmr/m/n/10 m
....\glue 3.33333 plus 1.66666 minus 1.11111
....\OT1/cmr/m/n/10 i
....\OT1/cmr/m/n/10 p
.......

So with a bit of perl (which might need to be made smarter in a real example) You can re-constitute the text adding the requested line and paragraph markup:

#!/usr/bin/perl
while(<>){
    chomp();
    if(m@^\.[^ ]* (.)\s*$@){
    print "$1";
    }
    if (m@ligature ([^ ]*)\)\s*$@){
    print "$1";
    }
    if(m@^\.*\\glue ([0-9.]*)@){
    print " " if ($1 > 2);
    }
    print"\n<br>" if (m@\\baselineskip@);
    print"\n<p>" if (m@\\parskip@);

    print "\n\n<hr>\n\n" if (m@Completed box being shipped@);
}

then perl zz.pl zz.log > zz.html produces:

.....
<br>fau-cibus. Morbi do-lor nulla, male-suada eu, pul-v-inar at, mol-lis ac, nulla. Cur-
<br>abitur auc-tor sem-per nulla. Donec var-ius orci eget risus. Duis nibh mi, congue
<br>eu, ac-cum-san eleifend, sagit-tis quis, diam. Duis eget orci sit amet orci dig-nis-sim
<br>rutrum.
<p>
<br>Nam dui ligula, fringilla a, euismod sodales, sollicitudin vel, wisi. Morbi
<br>auctor lorem non justo. Nam lacus libero, pretium at, lobortis vitae, ultric
...

which looks like

enter image description here