[Math] Probability of unique elements in each of ‘S’ multisets sampled with uniform probability

co.combinatoricspr.probability

Assume I have some set $P$ with $||P|| = N$ unique elements. I also have $S$ multisets, $(m_1, …, m_S)$, of cardinality $L$, consisting of elements in $P$ chosen with uniform probability. We call a multiset, $m_i$, 'distinguishable' if it contains at least $k$ elements, though not necessarily distinct elements, that exist in no other multiset.

What is the probability of all $M$ multisets being 'distinguishable' according to this definition?


Edit – I would very very interested in approximate solutions to this question! As in my "bounded ratio of two types of balls question", here $N$ is at least an order of magnitude larger than $S$, $S > 10^3$, and $L$ is relatively small (around $10^1$ to $10^2$ or so).


While sampling $S$ times with replacement from the set $P$, we can state the probability of never choosing the same element twice as:

Prob( $S$ unique selections from $P$ ) = $\prod \frac{(N – i)}{N}$ for $i = 0$ to $(S – 1)$

Or equivalently, we can calculate the probability that the multiset of $S$ sampled elements contains all unique elements as:

Prob( $S$ unique selections from $P$ ) = $\prod ((1-(\frac{1}{N – i}))^{(S – 1 – i)})$ for $i = 0$ to $(S – 1)$


Perhaps we can simplify this problem by restricting $k$ to include only distinct elements, i.e elements that exist only once in all of $(m_1, …, m_S)$ multisets?

Here's what I'm thinking…

We first calculate the probability that one of the $(S*L)$ elements in multisets $(m_1, …, m_S)$, selected from $P$ by sampling with replacement, is selected only once. This should be equivalent to tossing $(S*L)$ balls into $||P|| = N$ bins, and finding the probability that a particular ball is by itself in a bin.

From pg. 95 of "Probability and computing: Randomized algorithms and probabilistic analysis" by Michael Mitzenmacher and Eli Upfal, when we toss $(S*L)$ balls into $N$ bins, the probability that a specific bin has exactly $r$ balls, P[$r$], is given as:

P[$r$] = ${S*L \choose r}$ $(\frac{1}{N})^r(1-\frac{1}{N})^{(S*L-r)}$

By linearity of expectation, we can now write an expression for the expected number of balls that exist in a bin of $r$ balls as: E[X] = $N*r*{S*L \choose r}$ $(\frac{1}{N})^r(1-\frac{1}{N})^{(S*L-r)}$. As balls here correspond to elements in the early problem description, this implies that we can write P[element is unique] as:

P[element is unique] = $\frac{N*(1)*{S*L \choose 1}(\frac{1}{N})^1(1-\frac{1}{N})^{(S*L-1)}}{S*L}$

Returning to the original problem, we have $S$ multisets that we fill with $L$ elements, and we want to calculate the probability that at least $k$ of the elements in each multiset are unique (i.e. in all the multisets, they appear nowhere else). As we now know the probability that a particular element is unique, we can use the binomial formula to find the probability that a particular multiset contains at least $k$ unique elements:

P[at least 'k' elements in a particular multiset, $m_i$, are unique] = 1 – $\sum^{k-1}_{i=0}[ {L \choose i}$ * P[element is unique]$^i$ * (1 – P[element is unique])$^{L-i}$]

By linearity of expectation: $S$ * P[at least 'k' elements in a particular multiset, $m_i$, are unique] ~ # of multisets with at least $k$ unique elements.

To calculate the probability that all multisets contain at least $k$ unique elements, we should be able to write the probability as: P[at least 'k' elements in a particular multiset, $m_i$, are unique]$^S$


These calculations seem to come close to simulated data, but they're still off and I imagine this will prove to be an accident. I'd appreciate any help in finding what are probably obvious flaws? Are there issues with independence, etc?

Best Answer

Edit: Changed the formulas slightly, making it (hopefully) correct, but even less useful.


I think I can write down an expression for the probability, but I'm not sure how useful it is. My interpretation of the random experiment is that a random $(S\times L)$-matrix over $\{1,\ldots,N\}$ is generated by choosing the entries independently and uniform, and the $i$-th multiset is then simply the $i$-th row of this matrix. Clearly, the probability for every single matrix is $1/N^{S\cdot L}$. A matrix is good if every row contains at least $k$ entries that do not appear in any other row, so it remains to count the good matrices. Any good matrix induces a partition $\{1,\ldots,N\}=B_0\cup B_1\cup\cdots\cup B_S$, where $B_i$ for $i=1,\ldots,S$ is the set of elements occuring only in row $i$ (in particular nonempty). The number of good matrices corresponding to a given partition is $$\prod_{i=1}^S\sum_{K=k}^L\binom{L}{K}S(K,|B_i|)\cdot |B_i|!\cdot |B_0|^{L-K},$$ where $S(K,|B_i|)$ is the Stirling number of the second kind, the number of partitions of a $K$-set into exactly $|B_i|$ nonempty subsets.

Now let $\mathcal P$ be the set of ordered partitions $N=b_0+b_1+\cdots+b_S$ of $N$ into nonnegative integers, where $1\leqslant b_i\leqslant L$ for $i=1,\ldots,S$. Then the total number of good matrices can be written as $$A=\sum_{(b_0,b_1,\ldots,b_S)\in\mathcal P}\binom{N}{b_1}\binom{N-b_1}{b_2}\cdots\binom{N-b_1-\cdots-b_{S-1}}{b_S}\prod_{i=1}^S\sum_{K=k}^L\binom{L}{K}S(K,b_i)\cdot b_i!\cdot b_0^{L-K},$$ and your probability is $A/N^{S\cdot L}$.

Related Question