[Tex/LaTex] Encoding to be able to search pdf

copy/pastefont-encodingsinput-encodings

I am writing a document in Czech. I want

To be able to fully search the pdf (at the moment I can search only for words with non-Czech characters, but I want to be able to search with Czech characters as well).
To be able to copy & paste from the pdf. At the moment, when I copy the text and paste it in notepad, words such as "společnost" are pasted as "spolecˇnost".

My MWE:

\documentclass[11pt,a4paper]{article}
\usepackage[czech]{babel}
\usepackage[X]{inputenc}
\usepackage[Y]{fontenc}
\begin{document}
Zvyšuje se nebezpečí, že skupina pomatených lidí na společnosti napáchá obrovské škody.
\end{document}

What should go instead of X and Y? Thank you.

Best Answer

glyphtounicode and cmap
The X and Y are probably not the problems.

With reference to page 7 in the MinioPro-manual, to make figures and ligatures searchable, you need to enable glyphtounicode translation and load the default mapping table:

\input{glyphtounicode}
\pdfgentounicode=1

glyphtounicode was included in my MikTeX-distribution, but if it is not included in yours, you can find it at Sarovar.

This solution works with all fonts.

If you are using computer modern as font, you may try adding:

\usepackage{lmodern}

I also tried cmap, but it still not make the special glyphs searchable. In addition, I tried the tex-gyre font thermes (similar to times)

\usepackage{tgthermes}

The glyphs are not searchable (but should have been).

I guess this is because the glyphs are not defined in the font and therefore Tex constructs them by combining two other glyphs. I do not have the skills to help you further, but may be @egreg can: Copy Czech characters from PDF with charter font

newtx and tgtermes
Regarding newtx, this MWE compiles on my system, but the Czech characters are not searchable:

\documentclass[final,oneside,a6paper,11pt,norsk,article]{memoir}
%\documentclass{standalone}
\usepackage{fixltx2e}
\usepackage{babel}
\usepackage[osf]{newtxtext}
\input{glyphtounicode}
\pdfgentounicode=1

\usepackage{lipsum}
\usepackage[utf8]{inputenx}
\usepackage[T1]{fontenc}

\begin{document}

Dette er en prøve på æøå AÅØ
Dette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØDette er en prøve på æøå AÅØ

vis-à-vis ñ ôö ä

1234567890

Zvyšuje se nebezpečí, že skupina pomatených lidí na společnosti napáchá obrovské škody.

\end{document}

You have to extend the glyphtounicode-table with the Czech characters.

I have no time to look into that now, but perhaps some of the tex-wizards may help. I believe it is so simple as to provide some commands in the form:

\pdfglyphtounicode{A}{0041}
\pdfglyphtounicode{AE}{00C6}
\pdfglyphtounicode{AEacute}{01FC}
\pdfglyphtounicode{AEmacron}{01E2}

Where the first parameter is the code from the font and the second the Unicode-representative.

enter image description here

Related Solutions

[Tex/LaTex] Using inputenc package for cp1252 encoding

The ... unavailable in encoding OT1 error means, as Herbert said, that your problem is related to your font encoding and can be solved by loading the appropriate font encoding.

As you said you are new to LaTeX, maybe you want to also read What packages do people load by default, where you can learn that if you load T1 font encoding, you should usually also load a vector font, for example lmodern, and probably the babel package (the order does not matter).

Also you could consider to use utf8 as input encoding instead of cp1252, because today most editors do support utf8 and maybe at some point your input will have characters that are not available in cp1252. But you can as well switch any time later.

Maybe some time later you want references, then you should take a look at biblatex and biber.

Putting all these hints together, you get a MWE that looks like this (as an answer to your question the 4 lines fontenc...lmodern are sufficient):

\documentclass[
    a4paper,
    final
]{scrartcl}

\usepackage[T1]{fontenc} % font encoding
\usepackage[utf8]{inputenc} % input encoding
\usepackage[french]{babel} % keyword translation and hyphenation
\usepackage{lmodern} % lmodern looks better than cm-super

\usepackage[
    babel=true,
    verbose=true
]{microtype}

\usepackage[]{graphicx} % if you want figures
\usepackage[autostyle]{csquotes} % quotes

\usepackage[
    backend=biber,
    style=authoryear-icomp,
    sortlocale=fr_FR,
    natbib=true,
    url=false, 
    doi=true,
    eprint=false
]{biblatex}
\addbibresource{biblatex-examples.bib} % just an example
%\addbibresource{\jobname.bib} % include your own bib file

\usepackage[]{hyperref}
\hypersetup{
    colorlinks=true,
}


%% ===========================
\begin{document}

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, 
sed diam voluptua. 
At vero eos et accusam et justo \citet{kastenholz} et ea rebum. 
Stet clita kasd gubergren, 
no sea takimata sanctus est Lorem ipsum dolor sit amet. 

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, 
sed diam voluptua. 
At vero eos et accusam et justo duo dolores et ea rebum. 
Stet clita kasd gubergren, 
no sea takimata sanctus est Lorem ipsum dolor sit amet~\citep{sigfridsson}.

\printbibliography 

\end{document}

[Tex/LaTex] How to encode foreign characters so that they are searchable in the resultant PDF file

Can you try:

\usepackage[T1]{fontenc}

Full test document:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
Some foreign characters: öøäéüæåñ
\end{document}

Best Answer

Related Solutions

[Tex/LaTex] Using inputenc package for cp1252 encoding

[Tex/LaTex] How to encode foreign characters so that they are searchable in the resultant PDF file

Related Question