If I compile a .tex file to a PDF, how will the PDF encode a mathematical formula like f.ex. the integral below? Is it a bitmap? Or how else is it encoded? And how can I extract the formula from a PDF file?

Some more context: My goal is to train a neural net to output the latex code when I give it a mathematical symbol like the integral above as an input. The first step for this would be to find out how the symbol is represented in the PDF file so I can extract this part and use it as a label for the training data.

Thanks

## Best Answer

Essentially in pdf every letter (or run of letters) is positioned by coordinates so even a normal word might be encoded as individual letters positioned to "look" like text, so as to take account of inter-letter kerns etc.

Math is no different: the characters are just normal font characters positioned on the page at locations that TeX has determined.

PostScript uses the same rendering model as PDF but is a bit easier to read by eye, Taking Henri's example and using latex and dvips

Produces the following PostScript

where you can see the structure: strings are encoded as for example

`(dx)`

for dx and but apart from that 2 letter example all other character runs are single characters with the font and coordinates specified separately for each letter.