[Tex/LaTex] Latex to XML using pdflatex

pdftexxml

I am writing Latex to XML. While compiling pdflatex we generate a xml file using \immediate\write{text} in tagged commands. But how can i get normal text to XML file. Can any one explain with samples.

Best Answer

I don't know about the possible solutions with LaTeX and pdftex engine, but ConTeXt MkIV (which uses LuaTeX engine) supports an XML backend that is used to generate EPUB and tagged PDF.

To get the XML output from a file, you need to add

\setupbackend[export=yes]

As an example, consider a simple file with some figures, math, and lists.

\setupbackend[export=yes]
\setuppapersize[A5]
\starttext

\startsection[title={Sample Section}]

  \startplacefigure
      [location=right, title={A sample figure}]
      \externalfigure[cow][width=2cm]
  \stopplacefigure

  \input knuth

  \placeformula[eq:1]
  \startformula
    E = mc^2 
  \stopformula

  Einstein gave the expression~(\in[eq:1]).

  \startitemize[n]
    \startitem
      First point
    \stopitem

    \startitem
      Second point
    \stopitem
  \stopitemize
\stopsection

\stoptext

which generates the following PDF output

enter image description here

In addition, it generates the following XML file \jobname.export (notice that all the structural information is retained and math is exported to MathML)

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>

<!-- input filename   : test              -->
<!-- processing date  : Tue Dec  4 00:21:55 2012 -->
<!-- context version  : 2012.11.16 23:51  -->
<!-- exporter version : 0.30              -->


<document language="en" file="test" date="Tue Dec  4 00:21:55 2012" context="2012.11.16 23:51" version="0.30" xmlns:m="http://www.w3.org/1998/Math/MathML">
  <section detail="section" location='aut:1'>
    <sectionnumber>1</sectionnumber> 
    <sectiontitle>Sample Section</sectiontitle> 
    <sectioncontent>
      <float detail="figure" location='aut:2'>
        <floatcontent><image name="cow" id='image-1' width='2.000cm' height='1.455cm'></image></floatcontent>
        <floatcaption><floatlabel detail="figure">Figure </floatlabel><floatnumber detail="figure">1</floatnumber> <floattext>A sample figure</floattext></floatcaption>
      </float>
Thus, I came to the conclusion that the designer of a new system must not only be the implementer and first large--scale user; the designer should also write the first user manual.
      <break/>
The separation of any of these four components would have hurt TEX significantly. If I had not participated fully in all these activities, literally hundreds of improvements would never have been made, because I would never have thought of them or perceived why they were important.
      <break/>
But a system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.
      <formula>
        <formulacontent>
          <m:math display="block">
            <m:mrow>
              <m:mi>𝐸</m:mi>
              <m:mo>=</m:mo>
              <m:mi>𝑚</m:mi>
              <m:msup>
                <m:mi>𝑐</m:mi>
                <m:mn>2</m:mn>
              </m:msup>
            </m:mrow>
          </m:math>
        </formulacontent>
        <formulacaption>(<formulanumber detail="formula">1</formulanumber>)</formulacaption> 
      </formula>
Einstein gave the expression&#xA0;(1). 
      <itemgroup detail="itemize" symbol="n">
        <item>
          <itemtag>1.</itemtag>
          <itemcontent>First point</itemcontent>
        </item>
        <item>
          <itemtag>2.</itemtag>
          <itemcontent>Second point</itemcontent>
        </item>
      </itemgroup>
    </sectioncontent>
  </section>
</document>

Two auxiliary css files are also generated, the first \jobname-style.css which contains css for the font setup and any ConTeXt defined environments and colors:

/* styles for file test.export */

document {
    font-size  : 12pt !important ;
    max-width  : 300pt !important ;
    text-align : justify !important ;
    hyphens    : inherited !important ;
}

and a \jobname-images.css file that contains information about the images used in the tex file.

/* images for file test.export */

image[id="image-1"] {
    display           : block ;
    background-image  : url(cow) ;
    background-size   : 100% auto ;
    background-repeat : no-repeat ;
    width             : 2.000cm ;
    height            : 1.455cm ;
}