[Math] “Nyldon words”: understanding a class of words factorizing the free monoid increasingly

algebraic-combinatoricsco.combinatoricscombinatorics-on-wordslyndon-words

BACKGROUND.

Let me first introduce some classical definitions, which appear, e.g., in §5 of Lothaire's Combinatorics on Words, in §5.1 of Reutenauer's Free Lie algebras, and in §6.1 of Victor Reiner's and my Hopf algebras in Combinatorics. If you are not a stranger to combinatorics on words, scroll right down to the question.

Words. Let $\mathfrak A$ be a finite totally ordered set. We call $\mathfrak A$ the alphabet, and refer to the elements of $\mathfrak A$ as letters. Let $\mathfrak A^\ast$ be the disjoint union of the sets $\mathfrak A^n$ over all $n \in \mathbb{N} = \left\{0, 1, 2, \ldots\right\}$. The elements of $\mathfrak A^\ast$ are called words (over the alphabet $\mathfrak A$). The concatenation of two words $u = \left(u_1, u_2, \ldots, u_n\right) \in \mathfrak A^\ast$ and $v = \left(v_1, v_2, \ldots, v_m\right) \in \mathfrak A^\ast$ is defined to be the word $\left(u_1, u_2, \ldots, u_n, v_1, v_2, \ldots, v_m\right)$; it is denoted by $uv$. The set $\mathfrak A^\ast$, equipped with concatenation as a binary operation, becomes a monoid (the neutral element is the empty word, denoted by $\varnothing$); this is the free monoid on the set $\mathfrak A$. A word $u$ is said to be a prefix of a word $w$ if and only if there exists a word $v$ (possibly empty) satisfying $w = uv$.

Lexicographic order. We define a relation $\leq$ on $\mathfrak A^\ast$ by letting two words $u = \left(u_1, u_2, \ldots, u_n\right) \in \mathfrak A^\ast$ and $v = \left(v_1, v_2, \ldots, v_m\right) \in \mathfrak A^\ast$ satisfy $u \leq v$ if and only if either there exists a positive integer $i \leq \min\left\{n,m\right\}$ such that $u_i < v_i$ and ($u_j = v_j$ for every $j < i$), or the word $u$ is a prefix of $v$. This relation $\leq$ is (the smaller-or-equal relation of) a total order on $\mathfrak A^\ast$; this order is the so-called lexicographic order (or dictionary order) on $\mathfrak A^\ast$. It satisfies many of the properties one would expect from an order on words, but one must keep in mind that while $u \leq v$ always implies $wu \leq wv$ for words $u,v,w$, it does not always imply $uw \leq vw$.

Lyndon words. A Lyndon word is defined to be a nonempty word $w \in \mathfrak A^\ast$ satisfying one of the following three equivalent assertions:

  1. If $u$ and $v$ are any two nonempty words such that $w = uv$, then $v > w$.

  2. If $u$ and $v$ are any two nonempty words such that $w = uv$, then $v > u$.

  3. If $u$ and $v$ are any two nonempty words such that $w = uv$, then $vu > uv$.

Notice that the equivalence of these three assertions is not trivial; if we would restrict the assertions to a particular choice of $u$ and $v$, then they would cease to be equivalent.

For example, if $\mathfrak A = \mathbb N$ with the usual total order, then the words $4$, $223$, $04044$ and $0303304$ are Lyndon, while the words $\varnothing$, $10$, $11$ and $242$ are not.

Chen-Fox-Lyndon factorization. Lyndon words have a multitude of curious properties as well as applications to algebra (free groups, free Lie algebras, shuffle algebras, quasisymmetric functions etc.). Probably the most important result that is "behind" these applications is the Chen-Fox-Lyndon factorization theorem, which states that for every word $w \in \mathfrak A^\ast$, there exists a unique tuple $\left(a_1, a_2, \ldots, a_k\right)$ of Lyndon words satisfying $w = a_1 a_2 \cdots a_k$ and $a_1 \geq a_2 \geq \cdots \geq a_k$.

For instance, if $\mathfrak A = \mathbb N$ with the usual total order, for $w = 104029423402010110101$, this tuple is $\left(1, 04, 0294234, 02, 01011, 01, 01\right)$.

This factorization can be used as an alternative way to define Lyndon words. Indeed, assume that we are given some positive integer $N$, and we have defined the notion of Lyndon words of length $< N$. Then, we can define the notion of a Lyndon word of length $N$ as being a word of length $N$ for which there exists no tuple $\left(a_1, a_2, \ldots, a_k\right)$ of Lyndon words of length $< N$ satisfying $w = a_1 a_2 \cdots a_k$ and $a_1 \geq a_2 \geq \cdots \geq a_k$. This definition is implicit and impractical, and it does not help proving the uniqueness of the factorization, but I would not be surprised if it is how people actually discovered Lyndon words.

Necklaces. A necklace (also known as a circular word) means an equivalence class of words modulo cyclic rotation. A necklace of nonempty words is said to be aperiodic if the number of its representatives equals the length of each of its representatives. For instance, the necklaces $\overline{1213}$ and $\overline{24424}$ are aperiodic, while $\overline{1212}$ and $\overline{404040}$ are not. Here, $\overline{w}$ denotes the necklace containing a word $w$; so we have $\overline{1213} = \overline{2131} = \overline{1312} = \overline{3121} = \left\{ 1213, 2131, 1312, 3121 \right\}$. A known fact states that every aperiodic necklace contains precisely one representative which is a Lyndon word, whereas any non-aperiodic necklace contains no such representative. Thus, if the alphabet $\mathfrak A$ has finite cardinality $q$, then the number of Lyndon words of length $n$ (for fixed positive integer $n$) equals the number of aperiodic necklaces of $n$-letter words. This number is $\dfrac{1}{n} \sum\limits_{d\mid n} \mu\left(n/d\right) q^d$.

QUESTION.

Nyldon words. We can define the notion of Nyldon words by recalling our alternative definition of Lyndon words using Chen-Fox-Lyndon factorization, and replacing the condition $a_1 \geq a_2 \geq \cdots \geq a_k$ by $a_1 \leq a_2 \leq \cdots \leq a_k$. In more detail:

  • There are no Nyldon words of length $< 1$.

  • Assume that we are given some positive integer $N$, and we have defined the notion of Nyldon words of length $< N$. Then, we define the notion of a Nyldon word of length $N$ as being a word of length $N$ for which there exists no tuple $\left(a_1, a_2, \ldots, a_k\right)$ of Nyldon words of length $< N$ satisfying $w = a_1 a_2 \cdots a_k$ and $a_1 \leq a_2 \leq \cdots \leq a_k$.

Conjecture 1. For every word $w$, there exists a unique tuple $\left(a_1, a_2, \ldots, a_k\right)$ of Nyldon words satisfying $w = a_1 a_2 \cdots a_k$ and $a_1 \leq a_2 \leq \cdots \leq a_k$.

The existence part of this is obvious, but the uniqueness is interesting. I have checked it by computer for $\mathfrak A = \left\{0, 1\right\}$ and word length up to $18$.

Conjecture 2. Every aperiodic necklace contains precisely one representative which is a Nyldon word, whereas any non-aperiodic necklace contains no such representative.

This is checked for $\mathfrak A = \left\{0, 1\right\}$ and word length up to $15$.

Each of the Conjectures 1 and 2 would imply that the number of Nyldon words of given length over a given alphabet equals the number of Lyndon words of the same data. This is, in fact, equivalent to Conjecture 1 (because one can WLOG assume $\mathfrak A$ finite, and then argue using the fact that surjections between finite sets of equal size must be injections), so Conjecture 2 can be regarded as a stronger version of Conjecture 1.

EXPERIMENTAL DATA.

For $\mathfrak A = \left\{0, 1\right\}$, here are the Nyldon words of lengths up to $6$:

length $0$: none.

length $1$: the words $0$ and $1$. (For comparison: The Lyndon words are $0$ and $1$.)

length $2$: the word $10$. (For comparison: The Lyndon words are $01$.)

length $3$: the words $100$ and $101$. (For comparison: The Lyndon words are $001$ and $011$.)

length $4$: the words $1000$, $1001$ and $1011$. (For comparison: The Lyndon words are $0001$, $0011$ and $0111$.)

length $5$: the words $10000$, $10001$, $10010$, $10011$, $10110$ and $10111$. (For comparison: The Lyndon words are $00001$, $00011$, $00101$, $00111$, $01011$ and $01111$.)

length $6$: the words $100000$, $100001$, $100010$, $100011$, $100110$, $100111$, $101100$, $101110$ and $101111$. (For comparison: The Lyndon words are $000001$, $000011$, $000101$, $000111$, $001011$, $001101$, $001111$, $010111$ and $011111$.)

Best Answer

My co-authors (Émilie Charlier, Manon Philibert) and I give positive answers to Grinberg's conjectures in the paper E. Charlier, M. Philibert, M. Stipulanti, Nyldon words. So it is true that there are equally many Nyldon words and Lyndon words of a given length. In addition, we show that the set of Nyldon words is a Hall set (more precisely, we prove that it is a Lazard factorization, but the notions are equivalent). In this paper, we also compare Lyndon and Nyldon words regarding different properties such as standard and Širšov factorizations, circular and comma-free codes, and Lazard factorizations. We also leave in the paper open questions for future work. I hope you will enjoy it!