[Tex/LaTex] Parse a string into tokens of numbers and not numbers

parsing

I have a string that I want to parse into to numbers and non-numbers.

For my purposes:

A Number can EITHER be any sequential string of digits OR sequential string of digits with a . followed by another sequential string.

A Non-Number is anything that is not a Number.

For example

ljksadflh23898129hfafh0324.22234

should be parsed into something like

ljksadflh, 23898129, hfafh, 0324.22234

or

ljksadflh/23898129/hfafh/0324.22234

or whatever floats your boat as long as the list retains the same ordering.

Best Answer

With the experimental (but pretty much ready for release) package l3regex (found in the l3experimental bundle on CTAN), this task is a piece of cake.

\documentclass{article}
\usepackage{l3regex,xparse}
\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
  {
    \regex_extract_all:nnN { \D+ | \d+(?:\.\d*)? } {#1} \l_uiy_result_seq
    \seq_map_inline:Nn \l_uiy_result_seq { item:~##1\par }
  }
\ExplSyntaxOff
\begin{document}
  \UiySplit{ljksadflh23898129hfafh0324.22234}
\end{document}

The \regex line splits the user input #1 into pieces which either consist of one or more (+) non-digits (\D), or (|) of one or more digits (\d), followed maybe (? acting on the group (...), which we want to be "non-capturing", done using (?:...)) by a dot (\. escaped dot, because the dot has a special meaning) and zero or more digits (\d*). The line below maps through all the matches we found, with ##1 being a single match. Of course, you can do whatever you want to do with the items of the sequence \l_uiy_result_seq.

Edit: The module also provides regular expression replacements. If I remember the syntax correctly, the following should work.

\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
  {
    \tl_set:Nn \l_uiy_result_tl {#1}
    \regex_replace_all:nnN
        { (\D+) (\d+(\.\d*)) }
        { \c{uiy_do:nn} \cB{\1\cE} \cB{\2\cE} }
        \l_uiy_result_tl
    \tl_use:N \l_uiy_result_tl
  }
\cs_new_protected:Npn \uiy_do:nn #1#2 { \use:c {#1} {#2} }
\ExplSyntaxOff

This time, I catch both the sequence of non-digits, and the number, as captured groups, \1 and \2. Each such occurrence is replaced by the macro \uiy_do:nn (the \c escape in this case indicates "build a comman"), then a begin-group (\cB) character { (this time, \c indicates the category code), then the non-digits (\1), then an end-group (\cE) character }, then another \cB{, the number, and a closing \cE}.

After that, the token list looks like \uiy_do:nn {ljksadflh} {1}. We then simply use its contents with \tl_use:N. The final step is to actually define \uiy_do:nn. Here, I defined it as simply building a command from #1, and giving it the argument #2. This very simple action could be done at the replacement step using \c{\1} for "build a command from the contents of group \1", and technically it would be slightly better, producing an "undefined control sequence" error if the relevant command is not defined. Another option for that error detection to happen is to replace \use:c {#1} {#2} by \cs_if_exist_use:cF {#1} { \msg_error:nnx { uiy } { undefined-command } } {#2}, with an appropriately defined error message.

Related Question