PDFTeX Highlighting – How to Highlight Every Occurrence of a List of Words

formattinghighlightingkeywordsluatexpdftex

In order to revise a draft, and identify related sections, I would like to identify similar words (by color of text, highlight, underline, or otherwise) according to topic.

For example, I would like all uses of the terms "foo" or "bar" highlighted red and all uses of "biz" and "baz" highlighted green.

There might be four or five groups of words or word roots that I want to specify. This is only for revision, so it can be rather crude.

For example, replace this:

enter image description here

with this:

enter image description here

(In the example, it is hard to see the green text; perhaps bold+color or underline would be more useful)

Update A related question provides an answer using XeLaTex. My document does not compile with XeLaTex, I would prefer a solution compatible with pdflatex if available (since that is what I use), though my document also compiles with luatex.

Other related questions:

Best Answer

Solution using LuaTeX callbacks. Library luacolor.lua from luacolor is also used.

First package luahighlight.sty:

\ProvidesPackage{luahighlight}
%\RequirePackage{luacolor}
\@ifpackageloaded{xcolor}{}{\RequirePackage{xcolor}}
\RequirePackage{luatexbase}
\RequirePackage{luacode}
\newluatexattribute\luahighlight
\begin{luacode*}
highlight = require "highlight"
luatexbase.add_to_callback("pre_linebreak_filter", highlight.callback, "higlight")
\end{luacode*}

\newcommand\highlight[2][red]{
  \bgroup
  \color{#1}
  \luaexec{highlight.add_word("\luatexluaescapestring{\current@color}","\luatexluaescapestring{#2}")}
  \egroup
}

% save default document color
\luaexec{highlight.default_color("\luatexluaescapestring{\current@color}")}

% stolen from luacolor.sty
\def\luacolorProcessBox#1{%
  \luaexec{%
    oberdiek.luacolor.process(\number#1)%
  }%
}

% process a page box
\RequirePackage{atbegshi}[2011/01/30]
\AtBeginShipout{%
  \luacolorProcessBox\AtBeginShipoutBox
}
\endinput

command \highlight is provided, with one required and one optional parameters. required is highlighted word, optional is color. In pre_linebreak_filter callback, words are collected and when matched, color information is inserted.

Lua module, highlight.lua:

local M = {}

require "luacolor"

local words = {}
local chars = {}

-- get attribute allocation number and register it in luacolor
local attribute = luatexbase.attributes.luahighlight
-- local attribute = oberdiek.luacolor.getattribute
oberdiek.luacolor.setattribute(attribute)


-- make local version of luacolor.get

local get_color = oberdiek.luacolor.getvalue

-- we must save default color
local default_color 

function M.default_color(color)
  default_color = get_color(color)
end

local utflower = unicode.utf8.lower
function M.add_word(color,w)
  local w = utflower(w)
  words[w] = color
end

local utfchar = unicode.utf8.char

-- we don't want to include punctation
local stop = {}
for _, x in ipairs {".",",","!","“","”","?"} do stop[x] = true end


function M.callback(head)
  local curr_text = {}
  local curr_nodes = {}
  for n in node.traverse(head) do
    if n.id == 37 then
      local char = utfchar(n.char)
      -- exclude punctation
      if not stop[char] then 
        local lchar = chars[char] or utflower(char)
        chars[char] = lchar
        curr_text[#curr_text+1] = lchar 
        curr_nodes[#curr_nodes+1] = n
      end
      -- set default color
      local current_color = node.has_attribute(n,attribute) or default_color
      node.set_attribute(n, attribute,current_color)
    elseif n.id == 10  then
      local word = table.concat(curr_text)
      curr_text = {}
      local color = words[word]
      if color then
        print(word)
        local colornumber = get_color(color)
        for _, x in ipairs(curr_nodes) do
          node.set_attribute(x,attribute,colornumber)
        end
      end
      curr_nodes = {}
    end
  end
  return head
end


return M

we use pre_linebreak_filter callback to traverse the node list, we collect the glyph nodes (id 37) in a table and when we find a glue node (id 10, mainly spaces), we construct a word from collected glyphs. We have some prohibited characters (such as punctuation), which we strip out. All characters are lowercased, so we can detect even words at the beginning of sentences etc.

When a word is matched, we set attribute field of word glyphs to value under which is related color saved in luacolor library. Attributed are new concept in LuaTeX, they enable to store information in nodes, which can be processed later, as in our case, because at the shipout time, ale pages are processed by the luacolor library and nodes are colored, depending on their luahighlight attribute.

\documentclass{article}

\usepackage[pdftex]{xcolor}
\usepackage{luahighlight}
\usepackage{lipsum}

\highlight[red]{Lorem}
\highlight[green]{dolor}
\highlight[orange]{world}
\highlight[blue]{Curabitur}
\highlight[brown]{elit}
\begin{document}

\def\world{earth}
\section{Hello world}

Hello world, world? world! \textcolor{purple}{but normal colors works} too\footnote{And also footnotes, for instance. World WORLD wOrld}. Hello \world.

\lipsum[1-12]
\end{document}

enter image description here enter image description here

Related Question