[Tex/LaTex] Copying continuous text from PDF file to TXT file

copy/pasteline-breakingpdfviewers

My question doesn't explicitly refer to LaTeX itself, but the resulting PDF. To make long story short: I want to be able to copy text from PDF to TXT file as a continuous text. I believe this is common problem for many people.

I'm working on LaTeX document that is compiled to the PDF file. My text contains many word breaks at the end of a line. Sometimes I need to copy resulting text from PDF to the plain text file (*.txt). Unfortunately:

PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.

Therefore I shouldn't be surprised that when I compile the following text to PDF:

\documentclass{article}
\usepackage{graphicx}

\begin{document}

\title{Introduction to \LaTeX{}}
\author{Author's Name}

\maketitle

\begin{abstract}
The abstract text goes here.
\end{abstract}

\section{Introduction}
This is \LaTeX text that will be copied and pasted. Verylongword. Taumatawhakatangi­hangakoauauotamatea­turipukakapikimaunga­horonukupokaiwhen­uakitanatahu is a hill near Porangahau, south of Waipukurau in southern Hawke's Bay, New Zealand.

\subsection{Subsection Heading Here}
This text comes from Wikipedia.: The name "Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu" translates roughly as "The summit where Tamatea, the man with the big knees, the slider, climber of mountains, the land-swallower who travelled about, played his nose flute to his loved one".

\end{document}

which results in:

PDF created from example

and I copy whole text to TXT file, I get:

Introduction to L A TEX
Author’s Name
April 29, 2017
Abstract
The abstract text goes here.
1
Introduction
This is L A TEXtext that will be copied and pasted. Verylongword. Taumatawhakatangi-
hangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu is
a hill near Porangahau, south of Waipukurau in southern Hawke’s Bay, New
Zealand.
1.1
Subsection Heading Here
This text comes from Wikipedia.: The name ”Taumatawhakatangihangakoauauo-
tamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu” translates roughly
as ”The summit where Tamatea, the man with the big knees, the slider, climber
of mountains, the land-swallower who travelled about, played his nose flute to
his loved one”.

The most irritating are the word breaks and line endings that become new lines, so that copied text is not continuous. Is there any trick that can help me with copying text from PDF to TXT file as a continuous text?

Copying text directly from the TEX source is not an option, since the above example is very simplified – typical TEX source contains macros, formatting commands and so on.

Best Answer

Posting (as community wiki) the comment by Marijn about which the OP said “ Thanks @Marijn! You can post your answer, so that I can accept it.”

There is a question about this on Stack Overflow, and of the answers there, the one the OP finds working best is detex (opendetex). This is run on the source TeX file rather than the PDF file.

Just for completeness, other options mentioned on the Stack Overflow answer are:

  • catdvi which is run on the DVI file
  • Converting to HTML (with htlatex / tex4t / hyperlatex / hevea), then extracting text from the HTML file
  • Pandoc, a versatile converter between many formats
  • LaTeX2RTF to convert to RTF, then extract the text somehow
  • untex

See more at the UK TeX FAQ: Conversion from (La)TeX to plain text

Note: This is community wiki so please edit the answer rather than leaving comments, if possible.