[Tex/LaTex] How to fix missing or incorrect mappings from glyphtounicode.tex

characterscopy/pastepdftexsymbolsunicode

glyphtounicode.tex has been described as the best solution for generating copy-and-pasteable symbols. However, I find that various symbols that I need to use do not paste as the appropriate Unicode codepoints/characters. How can I fix this?

This is how my macros should be pasting:

  • \nSubset: ⋐̸ (U+22D0 U+0338)
  • \cong: ≅ (U+2245)
  • \ncong: ≇ (U+2247)
  • \bigcup: ⋃ (U+22C3)
  • \notin: ∉ (U+2209)
  • \neq: ≠ (U+2260)
  • \llbracket: ⟦ (U+27E6); \rrbracket: ⟧ (U+27E7)
  • \llparenthesis: (|; \rrparenthesis: |)
    • Update: Unicode offers the symbol pairs ⦇ ⦈ (U+2987/U+2988) and ⦅ ⦆ (U+2985/U+2986), which many might consider a better choice for these macros.
  • \coloneqq: ≔ (U+2254)
  • \models: ⊧ (U+22A7) [currently |=]
  • \Rsh: ↱ (U+21B1)
  • \textlengthmark: ː (U+02D0) [currently :]
  • \blackdiamond: ⬩ (U+2B29)
  • \sqbullet: ▪ (U+25AA)
  • \square: ▫ (U+25AB)

(Note: Earlier I erroneously stated that \neg pastes incorrectly as ¬ (U+FFE2) instead of ¬ (U+00AC). This is not correct: \neg pastes correctly; it's Word that replaces this by the other symbol, just as I noticed that Word doesn't copy all accented letters correctly from pdf-files (whereas they paste exactly right into Notepad). I actually don't know whether this is truly a Word issue (if so, it's likely a legacy encoding/font hack) or has to do with pdflatex or maybe the Unicode/non-Unicode clipboard distinction in Windows. Anyone feel free to add (non-ranty) insight into this.)

This is how they currently paste:

>; ; ;
S
; <; ,; J; K; L; M;B; |=;é; :; ˛; ‚; ˝

(The linebreaks before and after \bigcup ("S") are probably caused by it being a big operator, so they're nothing to worry about.)

Here is minimal example code:

\documentclass{article}
\input glyphtounicode % I am using the updated version from http://www.lcdf.org/type/ (lcdf-typetools-2.94.tar.gz).
  \pdfgentounicode=1
\usepackage[T3,T1]{fontenc} % The T3-encoding is required by the tipa-package.
\usepackage[noenc,safe]{tipa} % \textlengthmark
\usepackage{txfonts}
\usepackage[only,llbracket,rrbracket,llparenthesis,rrparenthesis]{stmaryrd}
\usepackage[mathb]{mathabx}

\begin{document}

\noindent \( \nSubset; \cong; \ncong; \bigcup; \notin; \neq; \llbracket; \rrbracket; \llparenthesis; \rrparenthesis; \coloneqq; \models; \Rsh; \mbox{\textlengthmark}; \blackdiamond; \sqbullet; \square\) \\

\noindent \( \nexists \)

\end{document}

The symbol ∄ (\nexists) at the end has been included to demonstrate that glyphtounicode.tex is compatible with my code, because it pastes correctly (and in fact requires a recent version of glyphtounicode.tex, see the comment in my code).

Best Answer

You can add your own definitions. Eg. here an example how to copy an "a" as "A":

\documentclass[a4paper,12pt]{article}

\usepackage[ansinew]{inputenc}
\usepackage[T1]{fontenc}
\input{glyphtounicode}

\pdfglyphtounicode{a}{0041} %0041=A
\pdfgentounicode=1
\begin{document}
aaaaa 
\end{document}

The main problem is naturally to find the names of the glyphs you are using. In case you know the font you can find the names in the afm or the pfb. You can also add \pdfcompresslevel=0 to your document and then inspect the pdf. Look for lines starting with /CharSet (there will be more than one if you use more than one font). E.g. if I add \int to the example I will find /CharSet (/integraltext) and integraltext is the name of the glyph.

In case that the symbol is not a single glyph or that its name is not unique or changes from one font family to the next you will probably need to use the accsupp-package. Is it possible to provide alternative text to use when copying text from the PDF?.

Related Question