[Tex/LaTex] Cannot copy text from the simplest PDF file

copy/pasteligaturespdf

Consider the following code:

\documentclass{article}
\begin{document}
   We define multiplication by
   $$v_1x = 0,\quad v_2x = v_1,\quad v_1y = -v_1,\quad v_2y = v_1$$
\end{document}

which looks like this

enter image description here

It's as simple as it gets, but when I copy the content of the PDF I get:

We de??ne multiplication by
v1x = 0; v2x = v1; v1y = ??v1; v2y = v1
1

So there are 4 errors:

  1. The fi in define disappears
  2. , becomes ;
  3. The minus sign is not being copied properly
  4. Another 1 appears at the very end of the copied text

How can I prevent this from happening? I use Texmaker with Miktex 2.9 and pdfLatex.

Best Answer

Unicode mapping based on font encoding

Packages cmap or mmap add information about glyph to Unicode conversions into the PDF file based on the used TeX encoding. The hooks into the font loading mechanism of LaTeX and should be used as early as possible, e.g.:

\RequirePackage{mmap}% (\usepackage does not work before \documentclass)
\documentclass{article}

Package mmap is used here, because it has better math support AFAIK.

Unicode mapping based on glyph name

An alternative is a feature of pdfTeX that adds the mapping to Unicode based on the name of the glyph in the font. Therefore it does not work for PK fonts, because they do not contain glyph names.

\pdfgentounicode=1 %    
\input{glyphtounicode}

Caution: Package cmap or mmap cannot be used together with \pdfgentounicode. The result would be a duplicated entry in the font data dictionary. This is not allowed in the PDF specification:

Note: No two entries in the same dictionary should have the same key. If a key does appear more than once, its value is undefined.

And copy&paste yield a random result depends on the PDF viewer.

Font encoding

Especially if you have accented characters or more special symbols you should consider using T1 font encoding. The default encoding for LaTeX is OT1 that support 7-bit only (max. 128 glyphs). Accented characters are constructed, that's bad for copy&paste:

\usepackage[T1]{fontenc}

You should have installed the cm-super font bundle that contain Type 1 versions of the EC fonts. Or use the modern Latin Modern fonts. They descend from the CM/EC fonts.

\usepackage[T1]{fontenc}
\usepackage{lmodern}