Are regular expressions sets

definitionnotationregular expressionsregular-language

I have a confusion concerning regular expressions and languages generated by them. In our course material, regular expressions are defined as follows.

If $\Sigma$ is an alphabet, regular expressions based on it are defined as follows.

  1. $\epsilon$ or the empty string is a regular expression

  2. $\varnothing$ is a regular expression

  3. for every $\alpha \in\Sigma$, $\alpha$ is a regular expression

  4. if $R$ and $S$ are regular expressions, the so is $R + S$, where $+$ denotes union

  5. if $R$ and $S$ are regular expressions, then so is $RS$, the concatenation of the expressions

  6. If $R$ is a regular expression, then so is $(R^*)$, where $(\cdot)^*$ is the Kleene closure.

On the other hand, sets of regular expressions are defined very similarly:

If $\Sigma$ is an alphabet, sets of regular expressions $\mathrm{RE}(\Sigma)$ based on it are defined as follows.

  1. the expression $\epsilon\in\mathrm{RE}(\Sigma)$

  2. $\varnothing\in\mathrm{RE}(\Sigma)$

  3. for every $\alpha \in\Sigma$, $\alpha\in\mathrm{RE}(\Sigma)$

  4. if $R, S \in\mathrm{RE}(\Sigma)$, then so is $R + S\in\mathrm{RE}(\Sigma)$

  5. if $R,S\in\mathrm{RE}(\Sigma)$ are regular expressions, then $RS\in\mathrm{RE}(\Sigma)$

  6. If $R$ is a regular expression, then so is $(R^*)$.

Finally, languages generated by regular expressions also have a very similar definition.

If $\Sigma$ is an alphabet and $R \in\mathrm{RE}(\Sigma)$, the language generated by $R$, denoted $\mathcal L (R)$ has the following definition.

  1. $\mathcal L(\epsilon) = \{\epsilon\}$

  2. $\mathcal L(\varnothing)= \varnothing$

  3. for every $\alpha\in\mathrm{RE}(\Sigma)$, $\mathcal L(\alpha) = \{\alpha\}$

  4. if $R, S \in\mathrm{RE}(\Sigma)$, then so is $\mathcal L(R + S) = \mathcal L(R)\cup\mathcal L(S)$

  5. if $R,S\in\mathrm{RE}(\Sigma)$, then $\mathcal L(RS) = \mathcal L(R)\mathcal L(S)$

  6. If $R\in\mathrm{RE}(\Sigma)$, then $\mathcal L(R^*) = \mathcal L(R)^*$.

It seems to me that regular expressions are sets, as the ''function'' $\mathcal L$ seems to be taking sets as inputs. I'm making this interpretation based on the notation $\mathcal L(\varnothing) = \varnothing$ in the third definition. My question then is, how should these notations be interpreted? It looks like sometimes sets are replaced by the elements of a set, which to me is very confusing.

We have an exercise where we need to show that $\mathcal L\big( (a_1 + \cdots + a_n)^* \big) = \Sigma^*$, where $a_i \in \Sigma$ for all $i \in \{1,\ldots,n\}$, but the fact that these definitions (or the notations within them) are so convoluted makes it difficult for me to reason about them.

Best Answer

Languages are sets. They're sets of strings.

Regular expressions are not sets. They are a notation for representing sets of strings. A regular expression is a description of a set of strings.

For example, the regular expression ${\bf a }{\bf b}^\ast$ represents the set of strings that begin with an a that might be followed by a string of bs.


The first section you quoted is describing what regular expressions look like. This is called the syntax of regular expressions. The third section explains what regular expressions mean: if you have a regular expression, what set does it represent? (Or “generate”—same thing.)

You're right that $\mathcal L$ is a function. It's the function that takes a regular expression — a notation that represents a set of strings — and tells you which set it represents.

The second section is saying the same as the first, but in a slightly different way. The first section says "here's what a regular expression can look like". The second section just puts it a little differently: "here's what is in the set of regular expressions". But the result is the same: to tell you what regular expressions are like.


Regarding your exercise, the notation here is a little confusing. Let's suppose, for concreteness, that our alphabet, $\Sigma$ includes just the symbols x and y. The exercise is asking about the meaning of the regular expression $$({\bf x} + {\bf y})^\ast$$

It wants you to show that the set represented by this expression (that's $\mathcal L(({{\bf x} + {\bf y})^\ast})$) includes every possible string of xs and ys. The person who wrote the exercise has to find some way to say “every possible string of xs and ys”. (This is called the “Kleene closure of the set $\{{\mathtt x}, {\mathtt y}\}$.) They could have written that phrase, but they didn't. Instead they used an abbreviation. The abbreviation is “$\Sigma^\ast$”. So you're being asked to show that if $\Sigma = \{{\mathtt x}, {\mathtt y}\}$, then $$ \mathcal L(({{\bf x} + {\bf y})^\ast}) = \Sigma^\ast$$ and, more generally, that if $a_1, a_2, \ldots$ are the elements of some alphabet $\Sigma$, then $$ \mathcal L((a_1 + a_2 + \ldots)^\ast) = \Sigma^\ast$$

I hope this helps.

Related Question