QGIS – Capturing All Matches with Regular Expressions in QGIS

expressionqgisqgis-expressionregular expression

Problem

I use QGIS with the expression function regexp_matches() that returns an array of all strings captured by capturing groups. Regular expressions normally capture only the first match (occurence) in a string. So for example the following expression:

regexp_matches('These are my QGIS strings!','(s.)')

returns only [ 'se' ], as expected: the first occurence of s (position 4 of the input string), followed by the next character.

Question

How to rewrite the expression to capture all occurences of s, followed by the next character? The result I want would look like: [ 'se', 'st', 's!' ]

What I tried

Using the regexp search pattern '(s.).*(s.).*(s.)' would be a workaround. It's not very elegat as you already have to know how often s occurst. There seems to be a /g modifier in regular expressions for global matching, however, I was not able to find out the correct syntax.

Best Answer

I believe the global matching flag is not (yet) supported, but there is a workaround.

regexp_replace comes to the rescue as it will replace any occurrence of the pattern. The key trick is that everything must be found in the various matching groups. We will make it remove what does not correspond to the initial pattern and we will add a separator.

regexp_replace('These are my QGIS strings!',
               '([^s])*(s.)([^s])*',
               '\\2;')

==> 'se;st;s!;'

Let's break it down: Anything between parenthesis is a capture group, which is numbered starting at 1.

([^s])*: Anything that is not an s, as many as possible (*)
(s.): Followed by the desired target, an s and its next character
([^s])*: Optionally followed by anything that is not an s, as many as possible
\\2;: replace by the 2nd capture group (the s.) followed by ;. This will be repeated for each occurence of the 2nd capture group :-)

The next step is to break this string into an array

string_to_array(
   regexp_replace('These are my QGIS strings!',
                  '([^s])*(s.)([^s])*',
                  '\\2;'),
delimiter:=';')

==> ['se','st','s!','']

And at last, since the last ; is creating an empty row in the array, we need to remove the last row:

array_remove_at(
   string_to_array(
     regexp_replace('These are my QGIS strings!',
                    '([^s])*(s.)([^s])*',
                    '\\2;'),
   delimiter:=';'),
-1)

==> ['se','st','s!']