I am pleased to be able to teach Martin Scharrer something he didn't know :)
A fully expandable sanitizer
The following is an implementation of a \Sanitize
command that:
Completely removes all control sequences and balanced braces in its argument.
Does not choke on nested braces.
Keeps spaces where they were requested either by " " or "\ " (for after macros).
Is fully expandable (i.e. can be put in \edef
or \csname
).
Edit: This is a revised version. My initial code had a few minor bugs that were a major pain to fix, and this is substantially rewritten. I think it's clearer, too.
How it works
There are three states: sanitizing spaces, sanitizing groups, and sanitizing tokens. We scan for "words" one at a time, then within each "word" look for groups that might be hiding spaces (TeX's macro scanner will only absorb delimited arguments with balanced braces). Finally, once we are satisfied that we are looking at genuinely contiguous tokens, we scan one at a time and throw out the ones that are control sequences, leaving only explicitly specified spaces (" " or "\ ").
From the inside out, the operation looks like this:
\SanitizeTokens
is a big nested conditional that tests its argument against the various special cases. During the sweep for spaces, all space characters were converted to \SanitizedSpace
tokens, and they are now converted to \RealSpace
s. Both \SanitizedSpace
and \SanitizeStop
are macros that expand to themselves, and since they are private, this means that testing against them via \ifx
is a reliable way to detect the exact control sequences (in the first version, these were \countdef
tokens, which have the same property but are not quite as private).
\SanitizeGroups
uses the tricky \def\SanitizeGroups#1#{
construction discussed in this question: Macros with # as the last parameter. It is the most legitimate such use I can imagine: its point is to detect groups, which you can't do using plain macro expansion in any other way. It guarantees that #1
has no groups in it, and since this comes after space elimination, it also has no spaces in it, so we can run \SanitizeTokens
straight away. We then "enter" the group and go back to eliminating spaces.
\SanitizeSpaces
uses pattern matching to grab the first chunk of text up until a space, excluding of course those spaces that are in groups. There is a technical trick here: every use of this macro has {}
right after it, before the text. The point of that is so that the argument scanner doesn't remove braces around a group constituting an entire "word" between spaces. If that happens, then we will erroneously treat it as though it's been cleared of spaces when, in fact, it has not. (Any unsanitized spaces would be eaten by \SanitizeTokens
because argument scanning ignores spaces.)
There are of course some cute utility macros. My favorite is \IfNoGapToStop
, which is called like this: \IfNoGapToStop.X. \SanitizeStop
, with X
being the quantity potentially containing a gap. If it has none, then the first gap is the visible space after the period; if it has a gap, then the two periods are in different components, and both arguments of \IfNoGapToStop
are nonempty.
Aside from the structural changes from the previous version, this one correctly preserves spaces at the boundaries of groups. (That version didn't explicitly scan for groups, but eliminated them as a side effect of absorbing tokens. That works, but it also makes it impossible to be sure when you are looking at a group, which may have spaces, rather than a single token.)
Oh, and of course: the algorithm is no longer stupid. The last version rescanned the entire initial portion of the text repeatedly while looking for words (the point of that was so as not to "lose" those tokens before sanitizing them). Now I crawl through the words one at a time, so there's no problem with abandoning each one when looking for the next. That turns a quadratic algorithm into a linear one.
This is not my preferred way of writing TeX anymore (for that, you should read this answer: How to write readable commands) but pgfkeys
is really not the tool for this kind of textual parsing.
\documentclass{article}
\makeatletter
\newcommand\Sanitize[1]{%
\SanitizeSpaces{}#1 \SanitizeStop
}
% This loops through and replaces all spaces (outside brace groups) with \SanitizedSpace's.
% Then it goes for the control sequences.
% All calls to this should put a {} right before the content, to inhibit the gobbling of braces
% if there is a group right at the beginning.
\def\SanitizeSpaces#1 #2\SanitizeStop{%
\IfEmpty{#2}% Last word
{\IfEmpty{#1}% No content at all
{}% Nothing to do
{\SanitizeGroups#1{\SanitizeStop}}%
}%
% No need for a trailing space anymore: there's already one from the initial call
{\SanitizeGroups#1\SanitizedSpace{\SanitizeStop}\SanitizeSpaces{}#2\SanitizeStop}%
}
% Sanitize tokens up to the next group, then go back to doing spaces.
\def\SanitizeGroups#1#{%
\SanitizeTokens#1\SanitizeStop
\EnterGroup
}
% Sanitize the next group from the top.
\newcommand\EnterGroup[1]{%
\ifx\SanitizeStop#1%
\expandafter\@gobble
\else
\expandafter\@firstofone
\fi
{\SanitizeSpaces{}#1 \SanitizeStop\SanitizeGroups}%
}
\newcommand\SanitizeTokens[1]{%
\ifx\SanitizeStop#1%
\else
\ifx\SanitizedSpace#1%
\RealSpace
\else
\ifx\ #1%
\RealSpace
\else
\if\relax\noexpand#1%
\else
#1%
\fi
\fi
\fi
\expandafter\SanitizeTokens
\fi
}
% We use TeX's proclivity to eat braces even for delimited arguments to eat the braces if #1
% happens to be just {}, which we put in.
% Even if we didn't put it in, {} is going to get thrown out when \SanitizeSpaces gets to it.
\newcommand\IfEmpty[1]{%
\IfOneTokenToStop.#1\SanitizeStop
{% #1 has at most space tokens
% and thus is nonempty if and only if there is a gap:
\IfNoGapToStop.#1. \SanitizeStop
}
{% #1 has non-space tokens
\@secondoftwo
}%
}
% Checks for a gap in #1, meaning #2 is nonempty
% This should only be used with \IfEmpty
\def\IfNoGapToStop#1 #2\SanitizeStop{%
% It's enough to check for one token, since #2 is never just spaces
\IfOneTokenToStop.#2\SanitizeStop
}
\def\IfOneTokenToStop#1#2{% From \IfEmpty, #1 is always a .
\ifx\SanitizeStop#2%
% If #2 is multi-token, the rest of it will fall in the one-token case and be passed over.
% If not, well, that's what we asked for.
\expandafter\@firstoftwo
\else
\expandafter\GobbleToStopAndSecond
\fi
}
\def\GobbleToStopAndSecond#1\SanitizeStop{%
\@secondoftwo
}
\makeatother
\def\SanitizeStop{\SanitizeStop}
\def\SanitizedSpace{\SanitizedSpace}
\def\RealSpace{ }
\begin{document}
\setlength\parindent{0pt}\tt
% Torture test
\edef\a{%
\Sanitize{ Word1 \macro{Word2 Word3}{\macro\ Word4}{ Word5} {Word6 }{}Word7{ }{{Word8}} }
}\meaning\a
\a
\medskip
% Examples
\edef\a{%
\Sanitize{\emph{This} sentence has \TeX\ macros and {grouping}. }
}\meaning\a
\a
\medskip
\edef\a{%
\Sanitize{{A}{ gratuitously {nested} sentence {}{{with many} layers}}.}
}\meaning\a
\a
\medskip
\end{document}
While I adore the power of the \tikzmark
concept, too, it seems (with the necessity to compile twice) to be overkill for this situation. Why not just box the content and measure its size?
The following implements this idea based on some code I originally developed for this answer to a question about highlighting elements in a lstlisting
environment while also keeping the syntax highlighting. The result is the \btHL
command, which works like a font-changing command (such as \color
or \bfseries
) in that it affects everything until the end of the group (not across line breaks); this was a requirement for playing together with listings
. The basic idea is to box the content and then typeset it inside a TikZ node. The bounding box of the tikzpicture
, however, is adjusted to the size of the content, so that the highlighting does not take extra space (to prevent "jumping content" if used with beamer overlays).
As a quick solution, I have implemented your \tikzhighlight
macro on this base; the code, however, could be simplified quite a bit if the content to highlight is always given as a macro parameter.
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{tikz}
\usepackage{amsmath}
\makeatletter
\newenvironment{btHighlight}[1][]
{\begingroup\tikzset{bt@Highlight@par/.style={#1}}\begin{lrbox}{\@tempboxa}}
{\end{lrbox}\bt@HL@box[bt@Highlight@par]{\@tempboxa}\endgroup}
\newcommand\btHL[1][]{%
\begin{btHighlight}[#1]\bgroup\aftergroup\bt@HL@endenv%
}
\def\bt@HL@endenv{%
\end{btHighlight}%
\egroup
}
\newcommand{\bt@HL@box}[2][]{%
\tikz[#1]{%
\pgfpathrectangle{\pgfpoint{0pt}{0pt}}{\pgfpoint{\wd #2}{\ht #2}}%
\pgfusepath{use as bounding box}%
\node[anchor=base west, fill=orange!30,outer sep=0pt,inner xsep=0.2em, inner ysep=0.1em, #1]{\usebox{#2}};
}%
}
\makeatother
\newcommand{\tikzhighlight}[2][red]{%
{\btHL[fill=#1!10,draw=#1,rounded corners]#2}%
}
\begin{document}
\begin{itemize}
\item this is \tikzhighlight[yellow]{a text to be highlighted}
\item {\tiny{this is a text to be \tikzhighlight{highlighted}}}
\item \huge{this is a text to be \tikzhighlight{highlighted}}
\end{itemize}
\begin{align*}
&\tikzhighlight[green]{\ensuremath{x+\dfrac{z}{y}}}=100\\
&x+\tikzhighlight[blue]{\ensuremath{y}}=100
\end{align*}
\end{document}
Some additional fine tuning could be applied to the dimensioning of the boxes and the bounding box.
Best Answer
The mighty
\the
TeX has many registers and internal parameters, whose list can be found in the TeXbook (supplemented by the e-TeX manual and the pdftex manual, for the respective extensions; many more internal parameters are introduced by XeTeX and LuaTeX).
In general,
\the\something
extracts a representation of the value assigned to\something
; the assignment can be explicit (\dimen100=2cm
) or implicit (\year
is assigned its value at the beginning of the TeX run), in some cases the parameter is "read only" (\badness
) and the value stored in it has been assigned during processing.Registers
In what follows,
\something
stands for a register of the analyzed type; for example, after\newcount\pippo
one can say\the\pippo
, or it's an explicitly mentioned register such as\count100
or\dimen0
.\count
:\the\something
extracts the counter's value representation as a number in base 10\dimen
:\the\something
extracts the stored length representation with unit "typographic point" (pt
) as a decimal number, always with at least a digit after the decimal point; for example, after\dimen0=2pt
,\the\dimen0
will produce2.0pt
\skip
: almost the same as before, but with the additionalplus
andminus
parts (which are omitted if zero)\muskip
: the same as\skip
, but the units are inmu
\toks
: this is a very special case, see later\box
: also this is a special case, as\the\box0
is illegal (the contents of box registers is accessed at with\box
,\copy
and related commands such as\unhbox
)Internal parameters
One can use
\the\something
where\something
is an internal parameter; for example,\the\tolerance
will behave just like case 1,\the\parindent
like in case 2,\the\baselineskip
as in case 3,\the\thinmuskip
as in case 4,\the\everypar
as in case 5. Similarly,\the\day
,\the\month
,\the\year
and\the\time
will print the values these internal parameters have been automatically assigned at the start of the job (or modified afterwards). Note that\time
is assigned the number of minutes past midnight when the job started (as determined by asking the operating system).Internal tables
TeX maintains some tables (or vectors): the
\catcode
table for category codes; the\uccode
and\lccode
tables for uppercase-lowercase conversion; the\sffcode
table for space factors; the\mathcode
table for deciding the nature of a character in math mode; the\delcode
table for deciding what to do if a character is encountered when TeX is looking for a math fence. In all these cases,produces the stored value in place
<number>
of the vector; the vector's length is 256 in the case of (pdf)TeX, 2^21 in the case of XeTeX and LuaTeX. For example,will produce respectively
1
and97
(with standard settings); of course, the<number>
can be expressed in other ways (as octal or hexadecimal number, or as an alphabetical constant). For (pdf)TeX the<number>
must be from 0 to 255; for XeTeX and LuaTeX from 0 to 2097151 (hexadecimal0x1FFFFF
).Special uses
\the
can also go before other tokens.If we've said
\chardef\pippo=37
, then\the\pippo
will produce37
; similarly for\mathchardef
. The representation will be a number in base 10.\the\font
produces a control sequence that corresponds to the command for selecting the current font; therefore,\xdef\pippo{\the\font}
will globally define the command\pippo
that will select the font current at the time of\xdef
.\the\hyphenchar\font
,\the\skewchar\font
,\the\fontdimen<number>\font
will extract the corresponding information for the current font; instead of\font
one can use any font selecting command (such as\tenrm
in Plain TeX or\OT1/cmr/m/n/10
in LaTeX). For example,will make available the normal interword space for the mentioned font (which is 3.33333pt).
\the<token register>
will produce (a copy of) the token list contained in the<token register>
; also internal tokens variables can be used:\the\everypar
will produce the token list contained in the stated variable.Only in the cases of
\the\font
(but instead of\font
can go any font selecting command) and\the<token register>
or\the<internal token variable>
TeX produces something which is not a string of characters.When
\the
produces a string of characters, they will all have category code 12, excepts spaces that receive category code 10.Important notes
\the
is expandable. So, while\def\pippo{\count100}
and\edef\pippo{\count100}
are completely equivalent,\edef\pippo{\the\count100}
will define\pippo
as the current register's value. If we want to store away the current chapter number in LaTeX, we say\edef\thischapternumber{\the\value{chapter}}
.\the
will perform expansion on the token following it, stopping only when the next token is a legal one which\the
can be applied to. So\the\value{chapter}
is possible, as\value{chapter}
expands to\csname c@chapter\endcsname
and then to\c@chapter
(that is a count register defined via\countdef
).\the
is sometimes superfluous. For example, if we want to keep the current category code of@
, in order to restore it after some processing, we might saybut there's a more efficient way:
Similarly, if we want to take different actions when the badness of the last produced box is less than 5000 or greater than 5000, we can say
Similarly, to set
\parindent
to the value stored in\normalparindent
(a\dimen
register allocated in advance), we don't saybut use the easier