[Tex/LaTex] Remove hyphen from word spanning two lines in text copied from a pdf file


If I copy text from a PDF and a word is hyphenated and spans two lines, the copied text contains the "-". For example:


should be copied as




The problem is that hyphens from source text must be conserved


must be


in copied form.

How can I achive this?

I think this question is related to Make ligatures in Linux Libertine copyable (and searchable)

Edit: I am sorry, my initial question was not well phrased. I typeset documents in LaTeX and compile them to PDF by PDFlatex (Miktex). Is it possible for PDFLaTeX to distinguish between 'line break' and 'interword' hyphens? Does the definition of PDF allows such different hyphens, so that a PDF reader, which respects the difference copies text that contains 'interword' hyphens, but not the 'line break' hyphens and the belonging line break?

Best Answer

Worst case scenario, the PDF has the hyphens at the end of the line rendered as the same hyphen that sits between words, let's call them 'line break' and 'interword' hyphens for now.

That would mean they are indistinguishable automatically (an interword hyphen might coincide with a line break; impossible to detect). In which case, search & replace (with nothing) to get rid of all of them, then S&R for words that are now known to miss a hyphen. Sorry.

Better case scenario is that the actual characters inside the PDF are different, even though they might look the same. Copying & pasting, depending on your PDF reader, tends to lose that distinction, if it was there in the first place. Same issue makes for 'end of line' (EOL) characters for every visible line in the PDF, rather than one at the end of a paragraph. LaTeX doesn't mind (it looks for empty lines) but your other text editing needs or tooling might.

On the assumption you have been copying&pasting, you might be able to get more results to work with by extracting the text from the PDF automatically. Google for 'PDF to text'; there are a number of options available, from Windows GUI tools, to OS X builtin PDF handling (look into Automator) to command line tooling for UNIX/Linux/Cygwin environments.

The output would be plain text. Some tools perform or allow for some manipulation of the extracted text, preserving only actual line endings rather than merely the ones shown, etc.

For text manipulation perse, the typical command line tools in a UNIX environment would be able to get the bulk of your issues out of the way. That may or may not be useable advice to you, but I would reach for Vim, sed and a sprinkling of regular expressions all wrapped in some Bash.