Guessing number of colors of beads in an urn

combinatoricsmultinomial-distributionprobabilityrecreational-mathematicsstatistics

Motivation from cocktail bar

Every time when I order the cocktail “Latex and Prejudice” (“Латекс и предубеждение”) in the Tesla bar in Saint Petersburg (Russia) the barkeeper selects by random a small interesting photo$^1$ and attaches it with a clamp to the cocktail glass. At the beginning I got always different pictures and started to collect them. The more cocktails I ordered the more often the motifs repeated. Finally, I had drunken so much that almost every time I had to ask for a different photo. I was wondering how many different pictures there are but due to drunkenness couldn't solve the problem by myself.

Mathematical form using urn model

Given is an urn with an unknown number of balls that have an unknown number of colors. It is assumed that every color has equal probability. From this urn in total $n$ balls with $m$ different colors were drawn, $k_i$ balls for color $i$ were sampled, i.e. $n=\sum_{i=1}^m k_i$. How many different colors $M$ are in the urn? Sampling with and without replacement is of interest.

Open questions

Some answers were already given. Now I am looking for either alternative answers or/and answers to the following more specific questions:

  1. Let's only for the first question assume that we do not know if the balls were drawn with or without replacement. Is the maximum likelihood for drawing with replacement always higher than the maximum likelihood for drawing without replacement?

  2. Is this answer helpful? If yes: Can we consider the calculated likelihoods as discrete probability distributions if they were be normalized (with support on integers in the range $(m,\infty)$)? What can we say about variance, standard error?

  3. What can we say about variance, standard error of another answer?

Related problems

In this SE post the number of colors in the urn is also unknown but in the problem given here it is assumed that the colors have equal probability. Another SE post deals with lending books from a library that were already lent at an earlier time.

Annotation 2023

The bar was visited a lot of times before the war.


$\small{^1 \text{Because the site is accessible to minors,$\\$ the content of the photos is not discussed here.}}$

Best Answer

Drawing without replacement

Let $n$ be the number of balls that were sampled with $m$ distinct colors and color frequencies $k_i$ for $i=(1,\ldots, m)$. We call $M$ the number of colors in the urn and $K$ the number of balls per color. $M$ and $K$ are not known, except of the facts that $M\ge m$ and $K\ge \max(k_i)$. Also we have assumed (concerning the original post) that every color has same frequency in the urn. The likelihood to get the observed color counts $k_i$, if we assume values for $M$ and $K$, is given by the multivariate hypergeometric distribution $$\frac{\prod_{i=1}^{m}\binom{K}{k_i}}{\binom{K\cdot M}{n}}.\tag{1}$$

As the order of the color frequencies in $k_i$ doesn't matter for our problem we have to multiply eq.$(1)$ by the number of their multiset permutations $$\frac{M!}{(M-m)!\prod_{j=1}^lz_j!},\tag{2}$$ where $z_j$ are the multiplicities of the elements in $k$ and $l$ is the number of different elements in $k$. The factor $(M-m)!$ in eq.$(2)$ includes the multiplicity of $0$, i.e. all colors that are contained in the urn but that were not drawn. (Note that $k_i \ge 1$, because we count only the colors that were drawn but we do not count the colors that were not drawn.) Multiplication of eq.$(1)$ and eq.$(2)$ gives the likelihood for observing $k_i$ colors assuming values for $M$ and $K$ $$p_1(M,K)=\frac{\prod_{i=1}^{m}\binom{K}{k_i}}{\binom{K\cdot M}{n}}\frac{M!}{(M-m)!\prod_{j=1}^lz_j!}\tag{3}$$

Answer: The best guess for the number of colors in the urn is given by that value of $M$ that maximizes $p_1$ for an assumed $K$.

Drawing with replacement

This corresponds to the case $K\to \infty$ in eq.$(3)$ and the likelihood assuming $M$ converges to a multinominal distribution that as previously is multiplicated by eq.(2) $$p_2(M)=\frac{n!}{\prod_{i=1}^m k_i!}\left(\frac{1}{M}\right)^n \frac{M!}{(M-m)!\prod_{j=1}^lz_j!}\tag{4}.$$ Answer: The best guess for the number of colors in the urn is given by that value of $M$ that maximizes $p_2$.

Example

Let's assume the frequencies of the colors of the drawn balls to be $k_i=(1, 2, 1, 1, 1, 2, 1, 1, 3, 1, 1)$. It follows that $n=15, m=11, K\ge 3, M\ge 11$. The multiplicities of the frequencies in $k$ are $z_j=(1,2,8)$ for $l=3$ different elements in $k$. Plotting the likelihoods shows that there are most likely $21$ different colors in the urn for drawing with replacement ($K=\infty$, black curve). For drawing without replacement and assuming that there are not more balls per color than given by the example ($K=3$, red curve), we get $16$ different colors in the urn as most likely variant. Assuming that in the urn were $K=6$ balls per color we get $18$ colors (blue curve). For higher $K$ the black curve is approached.

Related Question