How many words of length $k$ are there such that no symbol in the alphabet $\Sigma$ occur exactly once

automatacombinatoricscombinatorics-on-wordsinclusion-exclusion

Introduction

Given an alphabet $\Sigma$ of size $s$, I want to find a way of counting words $w$ of length $k$ which obey the rule: No symbol occurs exactly once in $w$.
We'll call this number $Q^s_k$.

I am particularly interested in closed-form expressions, for $Q_k^s$ or at least expressions that are fairly easy to calculate when the number of symbols is moderately large (say $s \sim 50$).

I'm not particularly up to speed in this area of maths, but I've tried a couple of different things. I'll list them below, and end with my questions on how to move on.

Deterministic Finite Automata

The language I've described above is regular, so it's possible to construct a discrete finite automaton describing it. Here's what that looks like for an alphabet of two symbols
DFA

The blue and green arrows correspond to inputs of the two different types of symbols. The accepting states of the DFA are 02, 20 and 22.

The number of accepted words of length $k$ is then the number of paths of length $k$ from the initial state to an accepting state. From this cs stackexchange question I've found that once you have the transition matrix of the DFA, the problem boils down to calculating powers of the transition matrix, and then looking at particular rows and columns. Unfortunately, there are $3^s$ states in the DFA (for each symbol, we can have encountered it 0, 1 or more than 1 times), and $s\times 3^s$ nonzero transitions between them.

Combinatorial, approach

My second approach to this problem was to try and find a recursive expression for $Q^s_k$.

If we restrict the alphabet to one symbol, i.e. $s = 1$, then if $k \neq 1$, there's exactly one valid word, otherwise there are none. Note that by definition the empty word is valid.

If we extend the alphabet to two words, we can write the new expression in terms of $Q^1$, giving $Q^2_k = \sum_{m=0; m\neq1}^k Q^1_{k-m} C(k, m)$. The idea is that we can form valid words with two symbols by taking $m$ instances of the new symbol and inserting into the valid words of length $k – m$ with one symbol. We don't use $m = 1$ in the sum since that would not give a valid word, and the case $k -m$ = 1 is taken care of by the fact that $Q^1_1 = 0$. The combinatorial factor accounts for the fact that we have to select $m$ slots in the final string for the occurrences of the new symbol, and all permutations of those slots are equivalent.

In fact, this approach generalises neatly to the recursive expression

\begin{equation} \tag{*}\label{eq:combinatorics}
Q^{s+1}_k = \sum_{m=0; m\neq1}^k Q^{s}_{k-m} C(k, m),
\end{equation}

since the logic for what happens when we add a new symbol to an existing alphabet is exactly the same.

Unfortunately, that's about where I got stuck.

Inclusion exclusion

As an alternative to trying to count all the valid words, we could take the perspective that the number of valid words is the total number of words minus the number of invalid words.

For example, the number of valid words of length $k$ with an alphabet of two words is $Q^2_k = 2^k – 2k + 2\delta_{k,2}$. The first term on the RHS is the total number of words of length $k$. The next term accounts for the fact that all words with one occurrence of one symbol and $k-1$ of the other are invalid; there are two symbols and $k$ slots to fit them into, giving $2k$. The last term accounts for the double counting that occurs when we exclude words where both the first and the second symbol occur exactly once.

In general we start with the $s^k$ possible words, and subtract all words where a given symbol only occurs once. For each symbol there are $k (s-1)^{k-1}$ such words, since there are $k$ slots for the symbol of interest, and the remaining slots must be filled from an alphabet consisting of all the other symbols. This gives $s\cdot k\cdot (s-1)^{k-1}$ words where a symbol occurs only once. However, this double counts all the cases where two symbols occur only once. There are $C(s, 2)$ pairs of symbols in the alphabet, and there are $P(k, 2)$ possible permutations of these symbols in a word of length $k$. The remaining $(k – 2)$ symbols are chosen from an alphabet of length $s – 2$, giving $(s-2)^{k-2}C(s, 2) P(s, 2)$ words of this form. Continuing, we arrive at the general expression

\begin{equation} \tag{**}\label{eq:inclusion}
Q^{s}_k = \sum_{i=0}^{\mathrm{min}(s, k)} (-1)^{i}(s – i)^{k-i} C(s, i) P(k, i)
\end{equation}

Note that for this expression to be true, we must consider $0^0 = 1$. Again, I was unable to simplify this much further.

Questions

  • Constructing the giant DFA feels like an unsatisfactory approach, because it completely ignores the symmetries in the problem. If the DFA is in the initial state "$000\ldots$", then any symbol it encounters is in some sense equivalent. Is there some clever way of using the symmetries of the problem to reduce the size of the DFA?
  • Can either (or both) of the expressions $\eqref{eq:combinatorics}$ and $\eqref{eq:inclusion}$ be simplified further?
  • Is there a nice algebraic argument for why $\eqref{eq:combinatorics}$ and $\eqref{eq:inclusion}$ are equal? I know they must be, since they count the same thing, but I don't see any simple way of showing it

Best Answer

Using combinatorial classes we get for this problem the following class:

$$\def\textsc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \textsc{SEQ}_{=s}(\textsc{SET}_{\ne 1}(\mathcal{Z}))$$

which gives the EGF

$$F(z) = (\exp(z)-z)^s.$$

The desired quantity is thus given by

$$k! [z^k] F(z) = k! [z^k] (\exp(z)-z)^s \\ = k! [z^k] \sum_{r=0}^s {s\choose r} (-1)^r z^r \exp((s-r)z) \\ = k! \sum_{r=0}^{\min(s,k)} {s\choose r} (-1)^r [z^{k-r}] \exp((s-r)z) \\ = k! \sum_{r=0}^{\min(s,k)} {s\choose r} (-1)^r \frac{1}{(k-r)!} (s-r)^{k-r}.$$

Re-arranging we find

$$\bbox[5px,border:2px solid #00A000]{ \sum_{r=0}^{\min(s,k)} {s\choose r} (-1)^r \frac{k!}{(k-r)!} (s-r)^{k-r}}$$

which is the formula obtained by the companion posters.

Here we are distributing $k$ distinguishable labelled balls (say labeled with $1$ to $k$) into a row of $s$ distinguishable slots or boxes (say labeled with $1$ to $s$). If a ball $q$ lands in slot $p$ that means position $q$ of the string is assigned the letter $p.$ No slot is allowed to hold just one ball.