[Tex/LaTex] Remove hyphen from word spanning two lines in text copied from a pdf file

copy/pastehyphenationpdftex

If I copy text from a PDF and a word is hyphenated and spans two lines, the copied text contains the "-". For example:

Examp-
le

should be copied as

Example

not

Examp-le

The problem is that hyphens from source text must be conserved

bug-
proof

must be

bug-proof

in copied form.

How can I achive this?

I think this question is related to Make ligatures in Linux Libertine copyable (and searchable)

Edit: I am sorry, my initial question was not well phrased. I typeset documents in LaTeX and compile them to PDF by PDFlatex (Miktex). Is it possible for PDFLaTeX to distinguish between 'line break' and 'interword' hyphens? Does the definition of PDF allows such different hyphens, so that a PDF reader, which respects the difference copies text that contains 'interword' hyphens, but not the 'line break' hyphens and the belonging line break?

Best Answer

Worst case scenario, the PDF has the hyphens at the end of the line rendered as the same hyphen that sits between words, let's call them 'line break' and 'interword' hyphens for now.

That would mean they are indistinguishable automatically (an interword hyphen might coincide with a line break; impossible to detect). In which case, search & replace (with nothing) to get rid of all of them, then S&R for words that are now known to miss a hyphen. Sorry.

Better case scenario is that the actual characters inside the PDF are different, even though they might look the same. Copying & pasting, depending on your PDF reader, tends to lose that distinction, if it was there in the first place. Same issue makes for 'end of line' (EOL) characters for every visible line in the PDF, rather than one at the end of a paragraph. LaTeX doesn't mind (it looks for empty lines) but your other text editing needs or tooling might.

On the assumption you have been copying&pasting, you might be able to get more results to work with by extracting the text from the PDF automatically. Google for 'PDF to text'; there are a number of options available, from Windows GUI tools, to OS X builtin PDF handling (look into Automator) to command line tooling for UNIX/Linux/Cygwin environments.

The output would be plain text. Some tools perform or allow for some manipulation of the extracted text, preserving only actual line endings rather than merely the ones shown, etc.

For text manipulation perse, the typical command line tools in a UNIX environment would be able to get the bulk of your issues out of the way. That may or may not be useable advice to you, but I would reach for Vim, sed and a sprinkling of regular expressions all wrapped in some Bash.

Related Solutions

[Tex/LaTex] How to make listings code indentation remain unchanged when copied from PDF

To prevent random spaces when copying the text from a listing, you need to use

\lstset{columns=flexible}

But you will now note that the text is not neatly aligned anymore; to solve this, you need to also use

\lstset{keepspaces=true}

This will not solve your problem with spaces disappearing at the beginning of lines when copying. The following hack will produce visible spaces and then make them invisible by coloring them in the background color:

\makeatletter
\def\lst@outputspace{{\ifx\lst@bkgcolor\empty\color{white}\else\lst@bkgcolor\fi\lst@visiblespace}}
\makeatother

This hack is not perfect, however, as the typesetted character is really a visible space, not a space (so searching the pdf for char line will not work) and some PDF readers (like Mac's preview) will copy a visible space. It works under Acrobat Reader and it's extremely pleasant to be able to quickly copy/paste code without problem (perhaps the problem can be circumvented by writing direct PDF code to tell that it's a space, I've never had the time to try). It might also not work with all typewriter fonts.

Here's the full code of your example:

\documentclass[12pt,oneside]{memoir}

\usepackage{listings}
\usepackage[T1]{fontenc}
\usepackage{xcolor}
\usepackage{textcomp}

\definecolor{codebg}{HTML}{EEEEEE}
\definecolor{codeframe}{HTML}{CCCCCC}

\lstset{language=Awk}
\lstset{backgroundcolor=\color{codebg}}
\lstset{frame=single}
\lstset{framesep=10pt}
\lstset{rulecolor=\color{codeframe}}
\lstset{upquote=true}
\lstset{basicstyle=\ttfamily}
\lstset{showstringspaces=false}

\lstset{columns=flexible}
\lstset{keepspaces=true}
\makeatletter
\def\lst@outputspace{{\ifx\lst@bkgcolor\empty\color{white}\else\lst@bkgcolor\fi\lst@visiblespace}}
\makeatother

\begin{document}

This code example prints out all users on your system:

\begin{lstlisting}[language=c]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LEN 1024

int main() {
    char line[MAX_LINE_LEN];
    FILE *in = fopen("/etc/passwd", "r");
    if (!in) exit(EXIT_FAILURE);

    while (fgets(line, MAX_LINE_LEN, in) != NULL) {
        char *sep = strchr(line, ':');
        if (!sep) exit(EXIT_FAILURE);
        *sep = '\0';
        printf("%s\n", line);
    }
    fclose(in);
    return EXIT_SUCCESS;
}
\end{lstlisting}

\end{document}

[Tex/LaTex] Cannot copy text from the simplest PDF file

Unicode mapping based on font encoding

Packages cmap or mmap add information about glyph to Unicode conversions into the PDF file based on the used TeX encoding. The hooks into the font loading mechanism of LaTeX and should be used as early as possible, e.g.:

\RequirePackage{mmap}% (\usepackage does not work before \documentclass)
\documentclass{article}

Package mmap is used here, because it has better math support AFAIK.

Unicode mapping based on glyph name

An alternative is a feature of pdfTeX that adds the mapping to Unicode based on the name of the glyph in the font. Therefore it does not work for PK fonts, because they do not contain glyph names.

\pdfgentounicode=1 %    
\input{glyphtounicode}

Caution: Package cmap or mmap cannot be used together with \pdfgentounicode. The result would be a duplicated entry in the font data dictionary. This is not allowed in the PDF specification:

Note: No two entries in the same dictionary should have the same key. If a key does appear more than once, its value is undefined.

And copy&paste yield a random result depends on the PDF viewer.

Font encoding

Especially if you have accented characters or more special symbols you should consider using T1 font encoding. The default encoding for LaTeX is OT1 that support 7-bit only (max. 128 glyphs). Accented characters are constructed, that's bad for copy&paste:

\usepackage[T1]{fontenc}

You should have installed the cm-super font bundle that contain Type 1 versions of the EC fonts. Or use the modern Latin Modern fonts. They descend from the CM/EC fonts.

\usepackage[T1]{fontenc}
\usepackage{lmodern}