[Tex/LaTex] Are LaTeX3 regular expressions able to match characters at the start of a line

expl3l3regex


I would like to use LaTeX3 regular expressions to extract the "lines" from text like the following:

- First line
- second line ending with \LaTeX
- third line with - in the middle
- fourth line that
goes on for a bit

By a "line" I mean text that starts with a dash - that occurs at the beginning of a "physical line", and then continues until the next such dash, or the end of the string.

My first guess was that something like the following would work:

\regex_split:nVN {  [\f\n\r]\s+-\s+  } { \Lines } \l_tmp_seq

My understanding is that this should match a "newline character" followed by - that is surrounded by white-space on both sides. Unfortunately, this fails horribly, producing only one match:

enter image description here

As this didn't work I thought that I would try a very simple-minded approach and just split the string on white-space surrounded dashes -. This would give the wrong result on lines like the third line above, but I can use -- in the middle of the line so this is a workable alternative. As \s matches white space, the following should split lines on -:

\regex_split:nVN { \s+-\s+  } { \Lines } \l_tmp_seq

This is better, but the results are still not perfect because, in addition to the expected problem on line 3, having \LateX at the end of a line confuses \regex_split presumably because macros tends to swallow spaces:

enter image description here

The best solution that I have found is using:

\regex_split:nVN { \B-\B } { \Lines } \l_tmp_seq

This produces the output that I was expecting from the last regex:

enter image description here

That is, we are getting an unwanted match for the - in the middle of the third line but this is otherwise OK.

Question Is there a regular expression that does not return any false matches? Initial and/or final matching empty sequences are allowed.

[Aside: I am not worried about the "empty" initial sequences above because I know that can get rid of them using

\seq_pop_left:NN \l_tmp_seq \l_tmpa_tl % pop initial blank

Terminating empty sequences can, of course, be dealt with in almost exactly the same way using \seq_pop_right:NN.]

Here is a full MWE to produce the examples above:

\documentclass{article}
\usepackage{xparse}
\parindent=0pt

\newcommand\BS{$\backslash$}
\newcommand\Test[1]{\textbf{Testing:}\quad\texttt{#1}}
\begin{document}

\newcommand\Lines{
    - First line
    - second line ending with \LaTeX
    - third line with -- in the middle
    - fourth line that
    goes on for a bit
}

\ExplSyntaxOn
\seq_new:N \l_tmp_seq
\cs_generate_variant:Nn \regex_split:nnN { nVN }

\Test{[\BS f\BS n\BS r]\BS s+-\BS s+}
\regex_split:nVN { [\f\n\r]\s+-\s+  } { \Lines } \l_tmp_seq
\begin{itemize}\item\seq_use:Nn \l_tmp_seq {\item }\end{itemize}

\Test{\BS s-\BS s}
\regex_split:nVN { \s-\s  } { \Lines } \l_tmp_seq
\begin{itemize}\item\seq_use:Nn \l_tmp_seq {\item }\end{itemize}

\Test{\BS B-\BS B}
\regex_split:nVN { \B-\B  } { \Lines } \l_tmp_seq
\begin{itemize}\item\seq_use:Nn \l_tmp_seq {\item }\end{itemize}

\ExplSyntaxOff

\end{document}

Escaping the - in the regular expressions doesn't help either.

Finally, the problem may be that the newline characters in \Lines are stripped when it is defined. As far as I can see this is OK because the code above gives similar output. In particular, if I use:

\documentclass{article}
\usepackage{xparse}
\parindent=0pt

\begingroup\obeylines
\gdef\Lines{
    - First line
    - second line ending with \LaTeX
    - third line with -- in the middle
    - third line that
    goes on for a bit
}
\endgroup

\begin{document}

  \ExplSyntaxOn
  \seq_new:N \l_tmp_seq
  \cs_generate_variant:Nn \regex_split:nnN { nVN }
  \regex_split:nVN { [\f\n\r]\s+\-\s+  } { \Lines } \l_tmp_seq

  Sequence~length:~ \seq_count:N \l_tmp_seq.
  \begin{itemize}\item\seq_use:Nn \l_tmp_seq {\item }\end{itemize}
  \ExplSyntaxOff

\end{document}

then I get the output:

enter image description here

In practice, I am getting \Lines as the \BODY of a environ environment, so I don't really have the luxury of this definition…

Best Answer

You need obeylines or something equivalent so that the newline char doesn't disappear. The following splits the lines (nearly) as wanted. Instead of \^^M one can also use \r (I only used the first because \show\Lines showed ^^M).

This creates this sequence:

The sequence \l_tmp_seq contains the items (without outer braces):
>  {First line}
>  {second line ending with \LaTeX }
>  {third line with -- in the middle}
>  {fourth line that^^Mgoes on for a bit^^M}.
<recently read> }

One problem could be that the fourth item contains a ^^M and so a probably unwanted line break.

\documentclass{article}
\usepackage{xparse}
\parindent=0pt

\newcommand\BS{$\backslash$}
\newcommand\Test[1]{\textbf{Testing:}\quad\texttt{#1}}
\begin{document}

\begingroup
\obeylines
\gdef\Lines{
    - First line
    - second line ending with \LaTeX
    - third line with -- in the middle
    - fourth line that
    goes on for a bit
}

\endgroup

\ExplSyntaxOn
\seq_new:N \l_tmp_seq
\cs_generate_variant:Nn \regex_split:nnN { nVN }

\Test{}\verb+^^M\-+
\regex_replace_once:nnN { \A\^^M?\-\s* } {} \Lines %to get rid of the first ^^M
\regex_split:nVN { \^^M\-\s* } \Lines \l_tmp_seq

\begin{itemize}\item\seq_use:Nn \l_tmp_seq {\item }\end{itemize}


\ExplSyntaxOff

\end{document}

enter image description here

Related Question