Probability of the cardinality of a set of elements with a given multiplicities from a bag of draws without replacement drawn from a bag of elements

combinatoricsprobability

Definition of the problem

Notations for bags

To define my problem, I will use a bag (a multi-set). An element in a bag can appear multiple times. For a bag $B$ and $e$ an element of $B$, we note the multiplicity of $e \in B$, $m_B(e)$ the number of times $e$ appears in $B$.

Note: the elements in a bag do not have any order.

The problem

Let's consider $B$ a bag of $n$ elements.

Now, we draw $n_d$ elements in $B$ without replacement. We consider $D$ the bag of draws. The draw follows a uniform law, we just pick elements randomly. Also, let's note $m_{\text{max}}= \text{max}\left(\{ m_B(e), e \in B\}\right)$ the maximal multiplicity of $B$.

Also, we have an addionnal condition $m_{\text{max}} < n_d < n$. This condition comes from my original problem (see the end of this post).

Given $M \in \Bbb N^{*}$ and $k \in \Bbb N^{*}$, I search the probability distribution of the cardinal of $\{ m_D(e) = M, e \in D\}$. Namely, I want to know the probability to draw $k$ elements $M$ times when I draw $n_d$ elements without replacement in $B$.

A short example

$$
B = \{ 1, 1, 1, 2, 2, 3, 4, 4, 4, 5, 5 \} \\
m_B(1) = 3, m_B(2) = 2, m_B(3) = 1, m_B(4) = 3, m_B(5) = 2
$$

$B$ has $11$ elements. Now, let's consider we want to draw $n_d = 5$ elements without replacement. There are a lot of different draws, I illustrate a few of them:

$$
D_1 = \{1, 1, 2, 3, 5 \} \\
D_2 = \{1, 2, 2, 4, 4 \} \\
D_3 = \{1, 1, 2, 5, 5 \} \\
$$

Now, I will illustrate my question with some pairs of $(k, M)$ values:

  • $k = 2, M = 1$: what is the probability to draw exactly 2 elements 1 time each?
  • $k = 1, M = 3$: what is the probability to draw exactly 1 element 3 times?
  • $k = 2, M = 2$: what is the probability to draw exactly 2 elements 2 times each?

In the general case, given $B$ a bag of $n$ elements, we perform a random draw without replacement $D$ of $n_d$ elements, what is the probability to find exactly $k$ elements of multiplicity $M$ in $D$?

What I tried

I try calculate the probability to draw an element $m_D(e)$ in $n_d$ draw (without replacement). This follows a hypergeometric law.

Namely, let's consider an element $e$ which has been drawn $m_D(e)$ times. We also note its original multiplicity in $B$ as $m_B(e)$. We perform the following counting:

  • We must choose $m_D(e)$ elements among $m_B(e)$. So we have $\binom{ m_B(e) }{ m_D(e) }$ possibilities.
  • Then, we count the rest of the draw. We need to pick $n_d – m_D(e)$ elements. We can pick them in $B$ as long as they are not $e$. Then, there are $ n – m_B(e) $ elements which can be picked. So, we have $ \binom{ n – m_B(e) }{ n_d – m_D(e) }$.
  • Finally, the overall possibilities to draw $n_d$ elements in a bag of $n$ elements is $\binom{ n }{ n_d }$.

Thus, we have the probability to have an element $e$ with multiplicity $m_D(e)$:

$$
\frac{\binom{ m_B(e) }{ m_D(e) } \cdot \binom{ n – m_B{e} }{ n_d – m_D(e) }}{\binom{ n }{ n_d }}
$$

Although, I cannot find a way to link this probability to what I want to search…

Origin of the question

I am doing some research in Machine Learning. I have a bag of $n$ samples as a dataset. Each sample has a class. So the multiplicity of an element is the number of samples per class. I randomly – uniformly – takes $m$ samples from the dataset without replacement. I would like to know, based on the numbers of samples in each class, what are the odds of having $k$ elements with multiplicity $M$, for any $M$.

Best Answer

Suppose that $B$ has $\ell$ distinct elements (i.e., its underlying set has size $\ell$). Without loss of generality, let's call them $1, 2, \ldots, \ell$, and let me simplify slightly the notation and write $m_i$ for $m_B(i)$ when $1 \leq i \leq \ell$. Consider a set $S \subseteq \{1, \ldots, \ell\}$, and let's ask the question: what is the probability that each element of $S$ is drawn exactly $M$ times when we draw $n_d$ elements from $B$ without replacement? I expect that you can convince yourself (along the lines of the argument you gave in the question) that the answer is $$ p_S := \frac{1}{\binom{n}{n_d}} \cdot \prod_{s \in S} \binom{m_s}{M} \cdot \binom{n - \sum_{s \in S} m_s}{n_d - M \cdot |S|}. $$ Now we could think about this when $S$ is a set whose size is exactly $|S| = M$, but actually $p_S$ is not what we want because it includes situations in which elements not in $S$ coincidentally also are chosen exactly $M$ times. Thus, we have to use inclusion-exclusion: for $S \subset \{1, \ldots, \ell\}$, let $q_S$ be the probability that each element of $S$ is drawn exactly $M$ times and also no elements not in $S$ are drawn exactly $M$ times. By definition, we have $$ p_S = \sum_{T : S \subseteq T \subseteq \{1, \ldots, \ell\}} q_T. $$ It follows by inclusion-exclusion that $$ q_S = \sum_{T : S \subseteq T \subseteq \{1, \ldots, \ell\}} (-1)^{|T| - |S|} p_T. $$ Combining this with the previous formula and summing over all sets $S$ of size $k$ gives the answer.

Related Question