[Tex/LaTex] Detokenizing without extra spaces


I have a need to pass through potentially arbitrary characters untouched and found this macro:

\def\test#1{\expandafter\zap@space\detokenize{#1} \@empty}

The problem is that \detokenize inserts spaces after commands in the expansion so I get rid of these with LaTeX's \zap@space. Unfortunately, I need to keep any spaces which \detokenize did not produce. I suspect that there is some cunning way to do this by redefining the catcode of the space character but it is somewhat beyond me.

For example (yes, it's regular expressions I need to pass through …),


Should expand to \A\d{2,}.+\z


\test{A test}

should expand to A test and not lose the space that was there originally.

I should say, that \verb doesn't work as these regexp strings are the value of a keyval pair:


Update It seems too hard to do this in general without making other things rather messy so I've settled for requiring that any regexps used are canonicalised to use no literal spaces in them. It's always possible to replace them with '\x20' for example in anyway so this means that \zap@space will only be zapping spaces created by \detokenize, which is fine. Many thanks for all of the answers though as they are very instructive.

Best Answer

It is impossible to distinguish \A . from \A. once TeX has converted those into tokens: the only solution if you need to preserve those spaces is to read the argument verbatim.

However, if you are fine with that, then the simplest method is to update the l3kernel and l3experimental bundles (and l3packages) to a very recent version (Februrary 2012), then use tools from the l3regex package to add \string in front of each token in the argument, and expand. The code below does that (replace \tl_show:N by whatever you want to do to the string).

\cs_new_protected:Npn \test #1
    \tl_set:Nn \l_tmpa_tl {#1}
    \regex_replace_all:nnN { . } { \c{string} \0 } \l_tmpa_tl
    \tl_set:Nx \l_tmpb_tl { \l_tmpa_tl }
    % now \l_tmpb_tl contains what you want:
    \tl_show:N \l_tmpb_tl
  \test{\A\d{2,}.+ Hello, world!\z}

How does it work? \regex_replace_all:nnN performs a replacement on a stored token list, so we need to store the argument.

\tl_set:Nn    % Set locally
  \l_tmpa_tl  % the "local temporary token list" `\l_tmpa_tl`
  {#1}        % to contain "#1" (the argument).
\regex_replace_all:nnN % Replace every occurrence of
  { . }                % any token, even braces etc.
  {                    % by
    \c{string}         %   \string
    \0                 %   what was matched (the token)
  } \l_tmpa_tl         % in \l_tmpa_tl
\tl_set:Nx        % Set locally, with expansion,
  \l_tmpb_tl      % the "local temporary token list b"
  { \l_tmpa_tl }  % to (the expansion of) `\l_tmpa_tl`
\tl_show:N    % Show the contents of
  \l_tmpb_tl  % the token list variable `\l_tmpb_tl` 

Of course, under the hood, l3regex does a lot of work so it will depend on how many such regular expressions you have to go through.

EDIT: A plain TeX solution for the very specific task your are asking for. I am assuming that the strings never contain the character ^^A (char code 1). The idea is to use \lowercase to change all true space tokens to some recognizable character. Then \detokenize, and loop through the result one character at a time (this automatically skips spaces) replacing ^^A by a space.

      % Ensure that every character is preserved by \lowercase.
      % Except spaces, changed to ^^A
% Then map {^^A => space, space =>} onto the string.
\test{ab c\d e{f} \fg }\show\result
Related Question