[Tex/LaTex] Defining a find and replace algorithm using LaTeX3’s l3regex

l3regexlatex3

I've been trying to work out the mechanics of LaTeX3's regular expression system as implemented in l3regex, but am having some difficulty understanding how/why it is acting as it is.

If I use as an example the following code:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\cs_new:Npn \demo #1 {
    \tl_new:N \l_demo
    \tl_set:Nn \l_demo {#1}
    \regex_replace_all:nnN {\_*?\_} {\emph{\1}} \l_demo
    \tl_use:N \l_demo
}
\ExplSyntaxOff
\begin{document}

\demo{This is a _test_ document.}

\end{document}

The following text will be printed on the page:

This is a œmph– ̋testœmph– ̋ document.

But I would have expected to see the following:

This is a test document.

Similar results arise through the use of other regular expressions similar to the above pattern.

Would anyone be able to explain what is happening in this example, and how problems such as this might be fixed?

Best Answer

The following produces the result you are looking for:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\tl_new:N \l_demo_tl
\cs_new:Npn \demo #1 {
    \tl_set:Nn \l_demo_tl {#1}
    \regex_replace_all:nnN { \_(.*?)\_ } { \c{emph}\cB\{ \1 \cE\} } \l_demo_tl
    \tl_use:N \l_demo_tl
}
\ExplSyntaxOff
\begin{document}

\demo{This is a _test_ document.}

\end{document}

In the matching expression we have

  • \_ giving the underscore character
  • (...) providing a group of characters that are to be remembered and used as \1 in the replacement text
  • .*? which matches (lazily) at zero or more occurrances of any character

In the replacement text

  • \c{...} provides a control sequence, so
  • \c{emph} is your emphasise command
  • \cB\{ produces the { character with the class of an opening token, for the beginning of the argument to your emphasise command
  • \cE\} ends the argument group.

In regular expressions and their replacements \ introduces a number of special constructions with non-standard meanings. The documentation gives some good examples.

As egreg points out, escaping _ is not strictly necessary, but the documentation recommends it, saying:

non-alphanumeric printable ascii characters can (and should) always be escaped

Note that, following egreg's kind remarks, in the code above, the variable \l_demo was renamed in the above code to \l_demo_tl in compliance with LaTeX3 syntax conventions; the _tl indicates this holds a token list. Also, the declaration of the variable is only needed once, so this is moved out of the control sequence.

Related Question