[Tex/LaTex] How is a mathematical formula represented in PDF


If I compile a .tex file to a PDF, how will the PDF encode a mathematical formula like f.ex. the integral below? Is it a bitmap? Or how else is it encoded? And how can I extract the formula from a PDF file?
enter image description here

Some more context: My goal is to train a neural net to output the latex code when I give it a mathematical symbol like the integral above as an input. The first step for this would be to find out how the symbol is represented in the PDF file so I can extract this part and use it as a label for the training data.


Best Answer

Essentially in pdf every letter (or run of letters) is positioned by coordinates so even a normal word might be encoded as individual letters positioned to "look" like text, so as to take account of inter-letter kerns etc.

Math is no different: the characters are just normal font characters positioned on the page at locations that TeX has determined.

PostScript uses the same rendering model as PDF but is a bit easier to read by eye, Taking Henri's example and using latex and dvips



$\int_0^2 x^2 dx$

Produces the following PostScript

%%Page: 1 1
TeXDict begin 1 0 bop 639 457 a Fc(R)695 477 y Fb(2)678
553 y(0)746 524 y Fa(x)793 494 y Fb(2)830 524 y Fa(dx)p
eop end

where you can see the structure: strings are encoded as for example (dx) for dx and but apart from that 2 letter example all other character runs are single characters with the font and coordinates specified separately for each letter.