[Tex/LaTex] What are good ways to make pdflatex output copy-and-pasteable

copy/pastefont-encodingslanguagespdftex

One frequent aspect of pdflatex output that could be improved is to what extent one can select text output from the generated pdf-file and copy-and-paste it. This is essentially an issue of having the right output encoding. There are at least two important aspects to this:

  • In the ideal case, Unicode would be the output encoding.
  • Ligature glyphs (such as fi) are ideally de-ligated in the output (here: "fi").

It seems that output encodings are specified through the fontenc package. What is the range of possible output encodings, and most importantly: Is there a way to specify a Unicode output encoding and one that deals with ligatures in the intended way?

(Note: It is important to distinguish output from input encodings. How to use different input encodings in LaTeX has been documented widely. My understanding is that this is best handled by the inputenx package, as it supersedes the inputenc package.)


Quick guide to the solutions: The information in the answers and answer threads is a bit distributed, so here the short summary: one approach is to use \input glyphtounicode together with \pdfgentounicode=1; the other approach is to use cmap/mmap.

Addendum: Sometimes one will need to load the package accsupp and enclose one's macro definition in \BeginAccSupp{method=hex,unicode,ActualText=<codepoint>}[…]\EndAccSupp{} to generate a specific code point. See the caveat about non-BMP code points here and the fix (starting from accsupp version v0.4, 2012/11/18) here, which provides the new unichar package option.

Best Answer

Do you Know the package cmap (look at CTAN:cmap)? It does this for you. Load it with \usepackage{cmap} as first package.

Update: I did a little research and found some hints how to use cmap or mmap. Result: cmap and mmap can't handle fonts based on virtual fonts (files *.vf or *.vpl in the font directory). So if your used font needs virtual fonts cmap or mmap can't work. You should better use glyphtounicode. This hint could be useful.

BTW: cmap is based on the character maps of Adobe. See PDF reference Version 1.6 for more information.

Update2: With texdoc encguide you will get a document describing the T1 Cork encoding. There you will find a table showing all the glyphs you can directly write in your pdf file. Because there is a glyph ä you can find ä in your pdf file. If there is no glyph ä LaTeX has to build it with two characters: "a. You will see ä in your pdf file but you can't find it (to search "a could help; copy and paste give you this back).

Update 3: glyphtounicode supports only the T1-encoding, cmap supports many more encodings. One advantage of glyphtounicode is that one can add with one command more glyphs which haven't yet been included in the official list. More information about \glyphtounicode is here: tb91thanh-fonts.pdf. Make sure to get the latest version of glyphtounicode.tex from LCDF Type Software and see this caveat on how to properly use it.