Coupon Collector’s Earworm – A Combinatorial Analysis

asymptoticsco.combinatoricsreference-request

[EDITED mostly to report on the answer by Kevin Costello
(and to improve the gp code at the end)]

I thank Nicolas Dupont for the following question
(and for permission to disseminate it further):

I have a playlist with, say, $N$ pieces of music.
While using the shuffle option (each such piece is played randomly
at each step), I realized that, generally speaking, I have to hear
quite a lot of times the same piece before the last one appears.
It makes me think of the following question:

At the moment the last non already heard piece is played, what is the
max, in average, of number of times the same piece has already been played?

I have not previously enountered this variant of the
coupon
collector's problem. If it is new, then (thinking of the original
real-world context origin of the problem) I propose to call it the "coupon
collector's earworm".

Is this in fact a new question?
If not, what is known about it already?

Let us call this expected value $M_N$. Nicolas observes that $M_1=1$
and $M_2=2$ (see below), and asks:

I doubt there is such an easy formula for general $N$ –
would you be able to find some information on it,
e.g. its asymptotic behavior?

[Nicely answered by Kevin Costello:
$M_N$ is asymptotic to $e \log N$ as $N \rightarrow \infty$.
Moreover, the maximal multiplicity is within $o(\log N)$ of $e \log N$
with probability $\rightarrow 1$. I don't recall any other instance
of a naturally-arising asymptotic growth rate of $e \log N$…]

Recall that the standard coupon collector's problem asks for the expected value
of the total number of shuffles until each piece has appeared at least once.
It is well known that the answer is $N H_N$ where $H_N = \sum_{i=1}^N 1/i$
is the $N$-th harmonic number. Hence the expected value of the average
number of plays per track is $H_N$, which grows as $\log N + O(1)$.
The expected maximum value $M_N$ must be at least as large $-$ in fact
larger once $N>1$, because one track is heard only once, so the average
over the others is $(N H_N-1) / (N-1)$. One might guess that $M_N$
is not that much larger, because typically it's only the last few tracks
that take most of shuffles to appear. But it doesn't seem that easy to get
at the difference between the expected maximum and the expected average,
even asymptotically. Unlike the standard coupon collector's problem,
here an exact answer seems hopeless once $N$ is at all large (see below),
so I ask:

How does the difference $M_N – H_N$ behave
asymptotically as $N \rightarrow \infty$?

[Looks like this was a red herring: "One might guess" plausibly that
$M_N – H_N$ is dwarfed by $H_N$ for large $N$, but by Kevin Costello's
answer $M_N – H_N$ asymptotically exceeds $H_N$ by a factor $e – 1$,
and that factor is more complicated than
$\lim_{N\rightarrow\infty} M_N/H_N = e$,
so analyzing the difference $M_N-H_N$ is likely not a fruitful approach.]

Here are the few other things I know about this:

@ For each $N>1$, the expected value of the maximum count
is given by the convergent $(N-1)$-fold sum
$$
M_N = \sum_{a_1,\ldots,a_{N-1} \geq 1}
N^{-\!\sum_i \! a_i} \Bigl(\sum_i a_i\Bigr)! \frac{\max_i a_i}{\prod_i a_i!}.
$$
Indeed, we may assume without loss of generality that the $N$-th track
is heard last; conditional on this assumption, the probability that
the $i$-th track will be heard $a_i$ times for each $i<N$ is
$N^{-\!\sum_i \! a_i}$ times the multinomial coefficient
$(\sum_i a_i)! \big/ \prod_i a_i!$. Numerically, these values are
$$
2.00000, \quad
2.84610+, \quad
3.49914-, \quad
4.02595\!-
$$
for $N=2,3,4,5$.

@ A closed form for $M_N$ is available for $N \leq 3$ and probably
not beyond. Trivially $M_1 = 1$; and N.Dupont already obtained
the value $M_2 = 2$ by evaluating $M_2 = \sum_{a \geq 1} a/2^a$.
But for $N=1$ and $N=2$ the problem reduces to the classical
coupon collector's problem. Already for $N=3$ we have a surprise:
$M_3 = 3/2 + (3/\sqrt{5})$, which has an elementary but somewhat tricky proof.
For $N=4$, I get
$$
M_4 = \frac73 – \sqrt{3} + \frac4\pi \int_{x_0}^\infty
\frac{(2x+1) \, dx}{(x-1) \sqrt{4x^3-(4x-1)^2}}
$$
where $x_0 = 3.43968\!+$ is the largest of the roots (all real) of
the cubic $4x^3-(4x-1)^2$. I don't expect this to simplify further:
the integral is the period over an elliptic curve of a differential with two
simple poles that do not differ by a torsion point. In general
one can reduce the $(N-1)$-fold sum to an $(N-2)$-fold one
(which is one route to the value of $M_3$ and $M_4$), or to an
$(N-3)$-fold integral, but probably not beyond.

@ It's not too hard to simulate this process even for considerably larger $N$.
In GP one can get a single sample of the distribution from the code

try(N) = v=vector(N); while(!vecmin(v),v[random(N)+1]++); vecmax(v)

[turns out that one doesn't need to call vecmin each turn:

try(N, m,i)= v=vector(N); m=N; while(m, i=random(N)+1; v[i]++; if(v[i]==1,m--)); vecmax(v)

does the same thing in $\rho+O(1)$ operations per shuffle rather then
$\rho+O(N)$, where $\rho$ is the cost of one
random(N) call.]

So for example

sum(n=1,10^4,try(100)) / 10000.

averages 1000 samples for $M_{100}$; this takes a few seconds, and
seems to give about $11.7$.

Best Answer

For the asymptotic case: Let $t_1=n \log n - Cn$ and $t_2 = n \log n + Cn$, where $C$ is slowly tending to infinity. It is a classic result that as $C$ tends to infinity the probability all coupons are collected at time $t_1$ tends to $0$, and the probability all coupons are collected at time $t_2$ tends to $1$.

So the number being asked for in the original post is with high probability sandwiched between the most collected coupon at time $t_1$ and the most collected coupon at time $t_2$. This has been studied a fair amount, though often in language of load-balancing and balls-in-bins ("if you toss $m$ balls in $n$ bins, what is the typical number of balls in the bin with the most balls"). In particular, there's a nice analysis of Raab and Steger that uses the second moment method to give a tight concentration on this. It follows from Theorem $1$ in their paper that at time $c n \log n$, the most common coupon has almost surely been collected $(d_c+o(1)) \log n$ times, where $d_c$ is the larger real root of $$1+x(\log c - \log x +1)-c=0$$

In our case, we have $d_1=e$, so the most common song will have been heard $(e+o(1))\log n = (e+o(1))H_n$ times.


In response to the comment: The first moment part of the calculation also comes with come concentration estimates for free. At time $t_1$, the probability that we've seen some coupon at least $C \log n$ times can be bounded by \begin{eqnarray*} & & \sum_{j=C \log n}^{t_1} n \binom{t_1}{j} n^{-j} (1-\frac{1}{n})^{t_1-j} \\ &\leq& \sum_{j=C \log n}^{t_1} n \left(\frac{e t_1}{ nj}\right)^j (1-\frac{1}{n})^{n \log n + o(\log n)} (1-\frac{1}{n})^{-j} \\ &=& n^{o(1)} \sum_{j=C \log n}^{t_1} \left(\frac{e \log n(1+o(1))}{j}\right)^j \\ &\leq& n^{o(1)} \sum_{j=C \log n}^{t_1} \left(\frac{e}{C}+o(1)\right)^j \end{eqnarray*}

Which should decay rapidly enough in $C$ to guarantee the mean lies close to the concentration.

The linked bounds for the probability of the completion time lying outside $[t_1, t_2]$ also decay quite rapidly (e.g. cardinal's answer, and David's subsequent comments on it).

Related Question