One frequent aspect of pdflatex
output that could be improved is to what extent one can select text output from the generated pdf-file and copy-and-paste it. This is essentially an issue of having the right output encoding. There are at least two important aspects to this:
- In the ideal case, Unicode would be the output encoding.
- Ligature glyphs (such as fi) are ideally de-ligated in the output (here: "fi").
It seems that output encodings are specified through the fontenc
package. What is the range of possible output encodings, and most importantly: Is there a way to specify a Unicode output encoding and one that deals with ligatures in the intended way?
(Note: It is important to distinguish output from input encodings. How to use different input encodings in LaTeX has been documented widely. My understanding is that this is best handled by the inputenx
package, as it supersedes the inputenc
package.)
Quick guide to the solutions: The information in the answers and answer threads is a bit distributed, so here the short summary: one approach is to use \input glyphtounicode
together with \pdfgentounicode=1
; the other approach is to use cmap
/mmap
.
Addendum: Sometimes one will need to load the package accsupp
and enclose one's macro definition in \BeginAccSupp{method=hex,unicode,ActualText=<codepoint>}
[…]\EndAccSupp{}
to generate a specific code point. See the caveat about non-BMP code points here and the fix (starting from accsupp
version v0.4, 2012/11/18) here, which provides the new unichar
package option.
Best Answer
Do you Know the package cmap (look at CTAN:cmap)? It does this for you. Load it with
\usepackage{cmap}
as first package.Update: I did a little research and found some hints how to use
cmap
ormmap
. Result:cmap
andmmap
can't handle fonts based on virtual fonts (files*.vf
or*.vpl
in the font directory). So if your used font needs virtual fontscmap
ormmap
can't work. You should better useglyphtounicode
. This hint could be useful.BTW:
cmap
is based on the character maps of Adobe. See PDF reference Version 1.6 for more information.Update2: With
texdoc encguide
you will get a document describing the T1 Cork encoding. There you will find a table showing all the glyphs you can directly write in your pdf file. Because there is a glyphä
you can findä
in your pdf file. If there is no glyphä
LaTeX has to build it with two characters:"a
. You will seeä
in your pdf file but you can't find it (to search"a
could help; copy and paste give you this back).Update 3:
glyphtounicode
supports only the T1-encoding,cmap
supports many more encodings. One advantage ofglyphtounicode
is that one can add with one command more glyphs which haven't yet been included in the official list. More information about\glyphtounicode
is here: tb91thanh-fonts.pdf. Make sure to get the latest version ofglyphtounicode.tex
from LCDF Type Software and see this caveat on how to properly use it.