Build unambiguous binary expression with property

combinatoricsregular expressions

Is there a strategy for building unambiguous binary expressions that fit some property like "no consecutive 1's", "can't start with 0", "Blocks of 1 can only be divisible by 5", etc.?

I can't seem to figure out a list of steps I want to follow after practicing.
Is there some sort of strategy I can follow for each problem?
For example I just start off by writing down the block decomposition unambiguous string for all binary strings $\{0\}^*\{\{1\}\{1\}^* \{0\}\{0\}^*\}^*\{1\}^*$ and then start tearing it apart/adding stuff randomly to make my property work, which I feel is not right.

For example the blocks of 1 need to be divisible by 5 example I mentioned, I just go "Ok let's just add $\{11111\}\{1\}^*$ in the middle instead of $\{1\}\{1\}^*$… and it should work??

But I don't even know if that's right.

Can I get some guidance on how to start these questions?

Thanks!

edit: more specifically, it would be great to get sort of a checklist of things to check to make sure the expression is valid. For example, I've kind of picked up that I should check that I'm not restricting my string to starting with something specific, by adding a {0} at the start, forcing it to start with 0.

Best Answer

What you are asking about is regular expressions. A regular expression is just a way to compactly describe a set of strings formed using characters from an alphabet $A$ (which we will refer to as a language) using three operations:

  1. Concatenation: We can indicate the concatenation of characters $a, b$ by $ab$. $ab$ simply denotes the language $\{ab\}$, consisting of one two element string $ab$.

  2. Or: Our second operation gives us the ability to specify alternatives that may occur in a pattern. We write $a | b$ to denote the language $\{a,b\}$, or $a|b|c|d|e$ for $\{a,b,c,d,e\}$. Concatenation generally is given precedence over or, thus as an example the expression $abc | de$ denotes the language $\{abc, de\}$.

  3. Closure: Our final operation is closure. Given a language $L$, we write $L^*$ to denote a new language that consists of all possible concatenations of strings in $L$. Thus as an example, if $L = \{a\}$, then $L^* = \{ \ \ ,a, aa, aaa, \dots\}$. If $L = \{a,b\}$, then $L^* = \{\ \ , a, b, aa, ab, ba, bb, aaa, aab, \dots\}$. Notice that the first element I've listed in $L^*$ is an empty string. This means that we can choose to include no strings from the language $L$ that we have chosen.

Lastly, we can use parentheses to order regular expressions. Thus to address your question, "how do we build unambiguous binary strings that fit a particular pattern?", let's do an example:

Strings with no consecutive 1's

One way to describe this language is to find some building blocks with which to build the language. Here are two types of strings that appeal as good building blocks for this language: those that start with a 1 and are followed only by zeros, and those that start with 0's and end with a single 1.

More compactly, we have languages $L_0 = 0\{0\}^*1$, and $L_1 = 10\{0\}^* $. Using these as building blocks, note that $L_0^*$ describes all strings without 1 repeating that start with a 0 and end with a 1. Likewise $L_1^*$ describes all strings without 1 repeating that start with a 1 and end with a 0. Thus all we are missing is all such strings starting and ending with a 0, and similarly a 1. Concatenating a 0 to $L_0^*$ generates all missing strings starting and ending with a 0, except for those that contain no 1's. Similarly concatenating a 1 to $L^*_1$ generates all missing strings starting and ending with a 1, with the exception of the singular string 1. All such strings of this type fall into one of the above categories.

Hence we may describe all strings without 1's by: $1 \ | \ \{0\}^* \ | \ L^*_{0} \ | \ L^*_0 0 \ | \ L^*_1 \ | \ L^*_1 1$


As you can see, regular expressions can often be tricky. I don't think there is a "one size fits all" algorithm that you can apply to learn how to describe all regex with ease, but what can help is to modularize the problem as I have done above. Instead of describing the entire language at once, break it into building blocks, and try to reconstruct it from those building blocks. Keep track of what you cannot form so far with the blocks you select, as you may not have enough to describe the entire language.

When you realize your regex cannot construct some family of strings you desire, just create a new regex to describe that family and use the Or operation. Without using all of the operations available to you, generating arbitrary languages becomes quite difficult indeed.


To check that an expression you have made is valid, it suffices to check two things:

  1. For every element of the language, the regex is able to describe it. For example, in the no consecutive 1's case, I showed that for strings starting and ending with either 0,0; 0,1; 1,0; or 1,1 that contained no consecutive 1's that the regex was able to generate these strings.

  2. The regex does not generate any element not in the language. You've got to ensure that no unintended elements are generated. For example, in the no consecutive 1's problem, if we were to have accidentally have done $\{1\{0\}^*\}^*$, then this could allow for consecutive 1's (when we select no zeros for the first string in the concatenation). Recognizing pitfalls like these allows you to ensure that you are not including extraneous strings.

Related Question