[Tex/LaTex] Dynamic calculation within regular expression quantifier in LaTeX3’s l3regex package

calculationsl3regexlatex3

I am trying to do some simple calculations to be applied to a regular expression quantifier used in l3regex's \regex_replace_all:nnN function. I've built my code around what I found here: Defining a find and replace algorithm using LaTeX3's l3regex.
I've also consulted l3regex's documentation but I can't really figure this one out.

Here's what I got right now, the new command \redhighlight highlights words of a specific length in red:

\documentclass[a5paper,14pt]{article}
\usepackage[ngerman]{babel}
\usepackage{expl3, l3regex, xparse}
\usepackage{xcolor}

\ExplSyntaxOn
\tl_new:N \l_redhighlight_tl
\NewDocumentCommand \redhighlight { O{1} m } {
    \tl_set:Nn \l_redhighlight_tl { #2 }    
    \regex_replace_all:nnN {
        % note: I am using \"?\w to match German Umlaut's
        (\"?\b\w)((?:\"?\w){#1})\b
    } {
        \cB\{\c{color}\cB\{red\cE\}\1\2\cE\}
    } \l_redhighlight_tl
    \tl_use:N \l_redhighlight_tl
}
\ExplSyntaxOff


\begin{document}

\redhighlight{Per default, all two-letter words 
                  are highlighted in red.}

\redhighlight[2]{By providing an optional integer 
                     value, one can state the length 
                     of words to be highlighted.}

\end{document}

Within the \regex_replace_all:nnN regular expression definition, i.e. (\"?\b\w)((?:\"?\w){#1})\b, rather than using the optional #1 parameter directly I'd like to do some calculation before using it in the quantifier expression.
I've tried the following:

\ExplSyntaxOn
\tl_new:N \l_redhighlight_tl
\int_new:N \l_optquant_int
\NewDocumentCommand \redhighlight { O{1} m } {
    \tl_set:Nn \l_redhighlight_tl { #2 }    
    \int_set:Nn \l_optquant_int { #1 - 1 }
    \regex_replace_all:nnN {
        % note: I am using \"?\w to match German Umlaut's
        (\"?\b\w)((?:\"?\w){\l_optquant_int})\b
    } {
        \cB\{\c{color}\cB\{red\cE\}\1\2\cE\}
    } \l_redhighlight_tl
    \tl_use:N \l_redhighlight_tl
}
\ExplSyntaxOff

But this does not seem to work. I'd appreciate any help.

Best Answer

You can only use literal numbers in the {n} part and, anyway, an integer cannot be used to get a literal number.

You have to fully expand the numeric expression to a decimal number, but you also have to ensure not expanding too much; the best is to do the calculation before passing the argument:

\documentclass[a5paper]{article}
\usepackage[ngerman]{babel}
\usepackage{expl3, l3regex, xparse}
\usepackage{xcolor}

\ExplSyntaxOn
\tl_new:N \l_flor_redhighlight_tl
\NewDocumentCommand \redhighlight { O{1} m }
 {
  \flor_redhighlight:fn { \int_to_arabic:n { #1 - 1 } } { #2 }
 }

\cs_new_protected:Npn \flor_redhighlight:nn #1 #2
 {
  \tl_set:Nn \l_flor_redhighlight_tl { #2 }    
  \regex_replace_all:nnN
   {
    % note: I am using \"?\w to match German Umlaut's
    (\"?\b\w)((?:\"?\w){#1})\b
   }
   {
    \c{textcolor}\cB\{red\cE\}\cB\{\1\2\cE\}
   }
   \l_flor_redhighlight_tl

   \tl_use:N \l_flor_redhighlight_tl
}
\cs_generate_variant:Nn \flor_redhighlight:nn { f }
\ExplSyntaxOff


\begin{document}

\redhighlight{As a default, only one letter words
                  are highlighted in red.}

\redhighlight[2]{By providing an optional integer 
                     value, one can state the length 
                     of words to be highlighted.}

\redhighlight[3]{By providing an optional integer 
                     value, one can state the length 
                     of words to be highlighted. F"ur}

\end{document}

enter image description here

Note that is preferred that \NewDocumentCommand passes control to an internal function, if the code is not really simple. In this case it's even essential! You can appreciate the power of “generating variants”.

Functions and variables should have a common prefix to avoid conflicts as much as possible. Also it's better \textcolor{red}{stuff} to {\color{red}stuff}.

Some explanations

What happens with this code? The main internal function \flor_redhighlight:nn expects, as its first argument, an explicit number to be used in a quantifier. However, the quantifier should be one less than the stated number, so passing [2] to \redhighlight really highlights two letter words and not three letter ones.

So the argument is passed in the form \int_to_arabic:n { #1 - 1 } to the variant \flor_redhighlight:fn, which essentially does

\flor_redhighlight:nn {<full expansion of #1>} { #2 }

One could have defined the variant with x instead of f and the result would have been the same. The difference is that x uses internally an \edef, while f works by pure expansion without resorting to \edef.

Related Solutions

[Tex/LaTex] Defining a find and replace algorithm using LaTeX3’s l3regex

The following produces the result you are looking for:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\tl_new:N \l_demo_tl
\cs_new:Npn \demo #1 {
    \tl_set:Nn \l_demo_tl {#1}
    \regex_replace_all:nnN { \_(.*?)\_ } { \c{emph}\cB\{ \1 \cE\} } \l_demo_tl
    \tl_use:N \l_demo_tl
}
\ExplSyntaxOff
\begin{document}

\demo{This is a _test_ document.}

\end{document}

In the matching expression we have

\_ giving the underscore character
(...) providing a group of characters that are to be remembered and used as \1 in the replacement text
.*? which matches (lazily) at zero or more occurrances of any character

In the replacement text

\c{...} provides a control sequence, so
\c{emph} is your emphasise command
\cB\{ produces the { character with the class of an opening token, for the beginning of the argument to your emphasise command
\cE\} ends the argument group.

In regular expressions and their replacements \ introduces a number of special constructions with non-standard meanings. The documentation gives some good examples.

As egreg points out, escaping _ is not strictly necessary, but the documentation recommends it, saying:

non-alphanumeric printable ascii characters can (and should) always be escaped

Note that, following egreg's kind remarks, in the code above, the variable \l_demo was renamed in the above code to \l_demo_tl in compliance with LaTeX3 syntax conventions; the _tl indicates this holds a token list. Also, the declaration of the variable is only needed once, so this is moved out of the control sequence.

Regular Expression using `expl3`

The last line should replace -ā by ā, -i by i, -u by u and so on.

To break it up:

\- represents the character -.
Next, everything wrapped in parenthesis (...) is the part of the string that should be use for the replacement.
Then follows a group wrapped in square brackets [...], which essentially means "one of these".
Following a repetition marker {1} meaning the character or group before "exactly once".
As for the replacement, \1 selects the first selection of the string, that is, the first part wrapped in parenthesis (...), which is [āiueoīū]{1} in this case.

So, it means replace a - followed by one of āiueoīū, but only exactly one character of these, by this very character. Essentially, it removes the -.

See the documentation of the l3regex package which is currently included in the doc of the LaTeX3 interfaces (chapter 8).

Best Answer

Some explanations

Related Solutions

[Tex/LaTex] Defining a find and replace algorithm using LaTeX3’s l3regex

Regular Expression using `expl3`

Related Question