[Tex/LaTex] Encoding to be able to search pdf

copy/pastefont-encodingsinput-encodings

I am writing a document in Czech. I want

  1. To be able to fully search the pdf (at the moment I can search only for words with non-Czech characters, but I want to be able to search with Czech characters as well).
  2. To be able to copy & paste from the pdf. At the moment, when I copy the text and paste it in notepad, words such as "společnost" are pasted as "spolecˇnost".

My MWE:

\documentclass[11pt,a4paper]{article}
\usepackage[czech]{babel}
\usepackage[X]{inputenc}
\usepackage[Y]{fontenc}
\begin{document}
Zvyšuje se nebezpečí, že skupina pomatených lidí na společnosti napáchá obrovské škody.
\end{document}

What should go instead of X and Y? Thank you.

Best Answer

glyphtounicode and cmap
The X and Y are probably not the problems.

With reference to page 7 in the MinioPro-manual, to make figures and ligatures searchable, you need to enable glyphtounicode translation and load the default mapping table:

\input{glyphtounicode}
\pdfgentounicode=1

glyphtounicode was included in my MikTeX-distribution, but if it is not included in yours, you can find it at Sarovar.

This solution works with all fonts.

If you are using computer modern as font, you may try adding:

\usepackage{lmodern}

I also tried cmap, but it still not make the special glyphs searchable. In addition, I tried the tex-gyre font thermes (similar to times)

\usepackage{tgthermes}

The glyphs are not searchable (but should have been).

I guess this is because the glyphs are not defined in the font and therefore Tex constructs them by combining two other glyphs. I do not have the skills to help you further, but may be @egreg can: Copy Czech characters from PDF with charter font

newtx and tgtermes
Regarding newtx, this MWE compiles on my system, but the Czech characters are not searchable:

\documentclass[final,oneside,a6paper,11pt,norsk,article]{memoir}
%\documentclass{standalone}
\usepackage{fixltx2e}
\usepackage{babel}
\usepackage[osf]{newtxtext}
\input{glyphtounicode}
\pdfgentounicode=1

\usepackage{lipsum}
\usepackage[utf8]{inputenx}
\usepackage[T1]{fontenc}

\begin{document}

Dette er en prøve på æøå AÅØ
Dette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØ

vis-à-vis ñ ôö ä

1234567890

Zvyšuje se nebezpečí, že skupina pomatených lidí na společnosti napáchá obrovské škody.

\end{document}

You have to extend the glyphtounicode-table with the Czech characters.

I have no time to look into that now, but perhaps some of the tex-wizards may help. I believe it is so simple as to provide some commands in the form:

\pdfglyphtounicode{A}{0041}
\pdfglyphtounicode{AE}{00C6}
\pdfglyphtounicode{AEacute}{01FC}
\pdfglyphtounicode{AEmacron}{01E2}

Where the first parameter is the code from the font and the second the Unicode-representative.

enter image description here