[Tex/LaTex] Placing the un-ligatured text in the OCR layer

ligaturespdf

I was reading Why can't "fi" be separated when being copied from a compiled pdf? and had a thought:

I know that v1.4 and up PDF documents have an OCR layer. Would it be possible to have PDFTeX or luaTeX place the un-ligatured text into the OCR layer, so that you don't have inconstant or odd behaviour when copying ligatures?

It also occurs to me that this also be a workaround for my troubles copying mathmode greek characters, discussed in Proper way to use greek letters in an English document . This also seemed like an ideal way to provide more accessible code for equations, as trying to copy them right now is not exactly useful.

Best Answer

Two possible tools you can use to achieve what you are after:

The first one is calibre-ebook, which can do pdf to pdf conversions, plus a tonne of other formats.

The second is Heiko Oberdiek's experimental package accsupp. This enables you to map alternatives to characters and text for accessibility purposes, but will achieve what you want as well. At the end of the day I think it is high time we stop using ligatures. They don't offer much in terms of improving typography on a screen.

The below example is from the package. You need to adapt to suit.

\documentclass{article}
\usepackage[unicode]{hyperref}
\usepackage{accsupp}[2007/11/14]
\begin{document}
  \begin{equation}
    \BeginAccSupp{
      method=pdfstringdef,
      unicode,
      ActualText={%
        a\texttwosuperior +b\texttwosuperior
        =c\texttwosuperior
      }
    }
    a^2 + b^2 = c^2
    \EndAccSupp{}
  \end{equation}
\end{document} 

Your suggestion to map it to the OCR layer is not an option unless you scan the PDF and OCR it. To summarize start with calibre-ebook and if it cannot offer you what you want explore the other options.

Related Question