[Tex/LaTex] What are good ways to make pdflatex output copy-and-pasteable

copy/pastefont-encodingslanguagespdftex

One frequent aspect of pdflatex output that could be improved is to what extent one can select text output from the generated pdf-file and copy-and-paste it. This is essentially an issue of having the right output encoding. There are at least two important aspects to this:

In the ideal case, Unicode would be the output encoding.
Ligature glyphs (such as ﬁ) are ideally de-ligated in the output (here: "fi").

It seems that output encodings are specified through the fontenc package. What is the range of possible output encodings, and most importantly: Is there a way to specify a Unicode output encoding and one that deals with ligatures in the intended way?

(Note: It is important to distinguish output from input encodings. How to use different input encodings in LaTeX has been documented widely. My understanding is that this is best handled by the inputenx package, as it supersedes the inputenc package.)

Quick guide to the solutions: The information in the answers and answer threads is a bit distributed, so here the short summary: one approach is to use \input glyphtounicode together with \pdfgentounicode=1; the other approach is to use cmap/mmap.

Addendum: Sometimes one will need to load the package accsupp and enclose one's macro definition in \BeginAccSupp{method=hex,unicode,ActualText=<codepoint>}[…]\EndAccSupp{} to generate a specific code point. See the caveat about non-BMP code points here and the fix (starting from accsupp version v0.4, 2012/11/18) here, which provides the new unichar package option.

Best Answer

Do you Know the package cmap (look at CTAN:cmap)? It does this for you. Load it with \usepackage{cmap} as first package.

Update: I did a little research and found some hints how to use cmap or mmap. Result: cmap and mmap can't handle fonts based on virtual fonts (files *.vf or *.vpl in the font directory). So if your used font needs virtual fonts cmap or mmap can't work. You should better use glyphtounicode. This hint could be useful.

BTW: cmap is based on the character maps of Adobe. See PDF reference Version 1.6 for more information.

Update2: With texdoc encguide you will get a document describing the T1 Cork encoding. There you will find a table showing all the glyphs you can directly write in your pdf file. Because there is a glyph ä you can find ä in your pdf file. If there is no glyph ä LaTeX has to build it with two characters: "a. You will see ä in your pdf file but you can't find it (to search "a could help; copy and paste give you this back).

Update 3: glyphtounicode supports only the T1-encoding, cmap supports many more encodings. One advantage of glyphtounicode is that one can add with one command more glyphs which haven't yet been included in the official list. More information about \glyphtounicode is here: tb91thanh-fonts.pdf. Make sure to get the latest version of glyphtounicode.tex from LCDF Type Software and see this caveat on how to properly use it.

Related Solutions

[Tex/LaTex] Proper use of cmap and mmap

The package mmap does a little bit more than cmap, it also works for mathematical symbols in your pdf.

So if your pdf does not use mathematics use \usepackage{cmap}. If you have problems with ligatures further on with computer modern use \usepackage[resetfonts]{cmap}. With mathemtic symbols use \usepackage{mmap}. If you have still problems use \usepackage[noTeX]{mmap}.

The differences are:

\usepackage{cmap}: accepted preloaded fonts without reloading.
\usepackage[resetfonts]{cmap}: as you can read in the README of cmap this forces the reloading of preloaded fonts (Computer Modern).
\usepackage[useTeX]{mmap} and \usepackage{mmap}: does everything cmap does plus correcting mathematical symbols in your pdf; uses new -m.cmap files ("uses ascii strings for the macro-names").
\usepackage[noTeX]{mmap}: does everything cmap does plus correcting mathematical symbols in your pdf; uses the cmap files (unicode).

Load cmap or mmap first, then fontenc and babel.

The documentation of fixltx2e does only say "load in the preamble". I had no problems loading it after fontenc, babel and the used fonts.

To do your own experiments use the follwing MWE:

\listfiles                      % shows used files
\documentclass[12pt]{scrartcl}
%\usepackage{cmap}              % pure T1 fonts 
%\usepackage[resetfonts]{cmap}  % pure T1 fonts, reset CM
%\usepackage{mmap}              % cmap + mathematics (ASCII)
%\usepackage[noTeX]{mmap}       % cmap + mathematics (Unicode)

 \usepackage[Latin9]{inputenc}  % or utf-8
 \usepackage[T1]{fontenc}       % font encription 
%\usepackage[T3,T1]{fontenc}    % T3 for package tipa
%\usepackage{tipa}              % Phonetic alphabet
 \usepackage[ngerman]{babel}    % neue deutsche Rechtschreibung

%\usepackage{lmodern}           % Latin Modern
%\usepackage{tgpagella}         % has no virtual fonts
%\usepackage[osf]{mathpazo}     % Minuskelziffern okay

%\usepackage{libertine}         % Libertine Legacy (with virtual fonts)
 \usepackage[osf]{libertine}    % mit Medivalziffern bzw. Minuskelziffern

\newcommand*{\III}{\libertineGlyph{Threeroman}}
\newcommand*{\IV}{\libertineGlyph{Fourroman}}


\begin{document}

Römische Zahlen: \III, \IV.

\textsc{Ligaturen}: auffliegen auffinden finden Auflage Schifffahrt.

\textsc{Korrekt}: auf\/fliegen auf\/finden finden Auf\/lage Schiff\/fahrt.

Ziffern: 0123456789.

Donau Donaudampfschiff Donaudampfschifffahrt Donaudampfschifffahrtskapitän 
Donaudampfschifffahrtskapitän 
Donaudampfschifffahrtskapitän Donaudampfschifffahrtskapitän 
Donaudampfschifffahrtskapitän Donaudampfschifffahrtskapitän

%\textipa{[\!b] [\:r] [\;B]}

\end{document}

Set or delete the comment sign to test cmap and mmap with or without fontenc and different fonts.

BTW: "Donaudampfschifffahrtskapitän" is a German word, good to get hyphenations.

[Tex/LaTex] Migrating from pdfTeX to LuaTeX: Problems with reproducing output for legacy projects

At first you should imho better use utf8 instead of utf8x. utf8x is unmaintained and has problems e.g. with biblatex. (You will have to set up the some missing definitions for pdflatex). You will also have to add some definitions for lualatex as it will map - as you already found out - undeclared chars simply to their unicode position. Here e.g. two definitions for ½ & µ:

\documentclass{article}
\usepackage[utf8]{luainputenc}
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\renewcommand{\rmdefault}{lmr}
\DeclareUnicodeCharacter{00BD}{\textonehalf}
\DeclareUnicodeCharacter{00B5}{\textmu}
\begin{document}
\begin{tabular}{@{}l*{10}{p{7mm}@{}}}
Some T1 characters:     & \# & \$ & \% & Ă & Ň & § & @ & Æ & ß   & £   \\[1.5mm]
Some non-T1 characters: & ‡  & ÿ  & ‰  & … & ¶ & ½ & µ %ĩ &  & | | & | | \\    \end{tabular}
\end{document}

Best Answer

Related Solutions

[Tex/LaTex] Proper use of cmap and mmap

[Tex/LaTex] Migrating from pdfTeX to LuaTeX: Problems with reproducing output for legacy projects

Related Question