Probability – Rolling a 6-sided Die to Obtain Every Number

coupon-collector-problemdiceprobability

I've just played a game with my kids that basically boils down to: whoever rolls every number at least once on a 6-sided die wins.

I won, eventually, and the others finished 1-2 turns later. Now I'm wondering: what is the expectation of the length of the game?

I know that the expectation of the number of rolls till you hit a specific number is
$\sum_{n=1}^\infty n\frac{1}{6}(\frac{5}{6})^{n-1}=6$.

However, I have two questions:

How many times to you have to roll a six-sided die until you get every number at least once?
Among four independent trials (i.e. with four players), what is the expectation of the maximum number of rolls needed? [note: it's maximum, not minimum, because at their age, it's more about finishing than about getting there first for my kids]

I can simulate the result, but I wonder how I would go about calculating it analytically.

Here's a Monte Carlo simulation in Matlab

mx=zeros(1000000,1);
for i=1:1000000,
   %# assume it's never going to take us >100 rolls
   r=randi(6,100,1);
   %# since R2013a, unique returns the first occurrence
   %# for earlier versions, take the minimum of x
   %# and subtract it from the total array length
   [~,x]=unique(r); 
   mx(i,1)=max(x);
end

%# make sure we haven't violated an assumption
assert(numel(x)==6)

%# find the expected value for the coupon collector problem
expectationForOneRun = mean(mx)

%# find the expected number of rolls as a maximum of four independent players
maxExpectationForFourRuns = mean( max( reshape( mx, 4, []), [], 1) )

expectationForOneRun =
   14.7014 (SEM 0.006)

maxExpectationForFourRuns =
   21.4815 (SEM 0.01)

Best Answer

Because a "completely analytical approach" has been requested, here is an exact solution. It also provides an alternative approach to solving the question at Probability to draw a black ball in a set of black and white balls with mixed replacement conditions.

The number of moves in the game, $X$, can be modeled as the sum of six independent realizations of Geometric$(p)$ variables with probabilities $p=1, 5/6, 4/6, 3/6, 2/6, 1/6$, each of them shifted by $1$ (because a geometric variable counts only the rolls preceding a success and we must also count the rolls on which successes were observed). By computing with the geometric distribution, we will therefore obtain answers that are $6$ less than the desired ones and therefore must be sure to add $6$ back at the end.

The probability generating function (pgf) of such a geometric variable with parameter $p$ is

$$f(z, p) = \frac{p}{1-(1-p)z}.$$

Therefore the pgf for the sum of these six variables is

$$g(z) = \prod_{i=1}^6 f(z, i/6) = 6^{-z-4} \left(-5\ 2^{z+5}+10\ 3^{z+4}-5\ 4^{z+4}+5^{z+4}+5\right).$$

(The product can be computed in this closed form by separating it into five terms via partial fractions.)

The cumulative distribution function (CDF) is obtained from the partial sums of $g$ (as a power series in $z$), which amounts to summing geometric series, and is given by

$$F(z) = 6^{-z-4} \left(-(1)\ 1^{z+4} + (5)\ 2^{z+4}-(10)\ 3^{z+4}+(10)\ 4^{z+4}-(5)\ 5^{z+4}+(1)\ 6^{z+4}\right).$$

(I have written this expression in a form that suggests an alternate derivation via the Principle of Inclusion-Exclusion.)

From this we obtain the expected number of moves in the game (answering the first question) as

$$\mathbb{E}(6+X) = 6+\sum_{i=1}^\infty \left(1-F(i)\right) = \frac{147}{10}.$$

The CDF of the maximum of $m$ independent versions of $X$ is $F(z)^m$ (and from this we can, in principle, answer any probability questions about the maximum we like, such as what is its variance, what is its 99th percentile, and so on). With $m=4$ we obtain an expectation of

$$ 6+\sum_{i=1}^\infty \left(1-F(i)^4\right) \approx 21.4820363\ldots.$$

(The value is a rational fraction which, in reduced form, has a 71-digit denominator.) The standard deviation is $6.77108\ldots.$ Here is a plot of the probability mass function of the maximum for four players (it has been shifted by $6$ already):

As one would expect, it is positively skewed. The mode is at $18$ rolls. It is rare that the last person to finish will take more than $50$ rolls (it is about $0.3\%$).

Related Solutions

Solved – Probability of winning a tournament

The model described in the links is not the diffusion model. The model you are trying to implement is called the Independent Chip Model or ICM. They give different estimates for your expected share of second and lower place prizes. Here are two ways to describe the ICM:

(1) Determine the winner so that each player's chance to win is proportional to his chip count. Then remove that player's chips, and determine the second place player so that each non-winner's chance to place second is proportional to his chip count. Repeat.

(2) Randomly remove the chips one at a time so that each remaining chip has the same chance to be removed next. When your last chip is removed, you are eliminated.

It's not obvious that these are the same model. In fact, it's clear that the first description only depends on the proportion of chips, and doesn't require that the number of chips is an integer. It's not obvious that the second method gives the same chance to place second if the stacks are $(100,200,300)$ as for $(1,2,3)$. (In other models, these situations are different.) However, you can see that these are equivalent (when both are defined) by a third description:

(3) Each player marks all of his chips. Then they are shuffled and ordered. Players are ranked by their highest chips.

Description (1) corresponds to looking at the top of the ordering. Description (2) corresponds to looking at the bottom.

You can find a lot of information about the Independent Chip Model on the web because it is used by serious tournament poker players, particularly those who play Sit-N-Go (SNG)/Single Table Tournaments (STTs). See, for example, this Nash Equilibrium Calculator for push/fold decisions in STTs which uses the ICM. There are other models like the diffusion model, but the Independent Chip Model seems good enough and is easier to compute. You can find a section on the ICM in my book, The Math of Hold'em, and I have also made videos on it for poker instructional sites.

One of the other answers asks why bother since it's all luck. Understanding equities assuming that you have no skill advantage (but the stacks are not equal) is how many serious poker players GET an advantage. Getting all-in with a $60\%$ chance to win and negligible dead money is sometimes great, and sometimes terrible. The equities say which. Also, if you run a poker server, you need to be prepared to divide the prize money fairly in case the server crashes during the tournament. A poker server asked me for help with this.

As you have noted, a naive implementation which computes your chance to place $p$ out of $n$ players sums over many terms, $(n-1) \times (n-2) \times ... \times (n-p) = \frac{(n-1)!}{(n-p-1)!}$. This may be too large in practice. One improvement I use in my program ICM Explorer is to memoize the probabilities with each subset of up to $p-1$ opponents removed. If you are computing each probability, this takes $n 2^{n-1}$ steps instead of $(n-1)!$, which makes the difference between whether you can calculate the case $n=10$ crisply, and whether you can calculate the case $n=20$ in under a second versus not at all.

If you have repeated stack sizes among your opponents, you can remember how many of them have been eliminated instead of which subset. This is particularly useful when you are analyzing multitable tournaments where you only see the stacks at your table, or only a few front-runners, and you assume everyone else you don't see has the same stack size. This makes calculating your equity feasible for large tournaments. This method has a complexity roughly equal to $k \prod_{i=1}^k (m_i+1)$ where there are $k$ different stack sizes among your opponents whose multiplicities are $m_1, ... ,m_k$.

There is one implementation I know about which lets the user calculate ICM equities for multitable tournaments. This uses a simulation. The author assured me that it converges rapidly to within a $0.1\%$ chance for each place. In case you need more accuracy, one simple variance reduction method works very well: Estimate your luck from being chosen or not to finish next at each step by the exact calculations with fewer distinct stack sizes. Subtract this estimate of luck (a vector) from the vector of probabilities obtained in the simulation.

For example, suppose your stack is $1000$, and you have $2$ opponents with stacks of $500$ and one with $1500$.

If one of the players with $500$ wins, your remaining opponents will average $1000$, so you estimate your chances using the ICM exactly assuming $2$ opponents with $1000$ chips. Since all stacks would be equal, by symmetry all $3$ players would have an equal chance to finish second, third, and fourth, so your place distribution would be $(0,1/3,1/3,1/3)$.

If the big stack wins, your remaining opponents average $500$, and the ICM says your place distribution is $(0,1/2,1/3,1/6)$.

If you win, obviously your place distribution is $(1,0,0,0)$.

The weighted average is

$$\frac{1000}{3500}(0,1/3,1/3,1/3) + \frac{1500}{3500}(0,1/2,1/3,1/6) + \frac{1000}{3500}(1,0,0,0) = (2/7,13/42,5/21,1/6).$$

So, if in your trial, you win, then you estimate your luck for that step by $(1,0,0,0) - (2/7,13/42,5/21,1/6)$, and subtract this from the result of the trial. If the big stack wins, the trial isn't over, but you estimate your luck for this step by $(0,1/2,1/3,1/6)-(2/7,13/42,5/21,1/6)$, and subtract this and future luck estimates from the outcome of the trial.

The luck estimate averages to $(0,0,0,0)$ and greatly reduces the number of trials needed to achieve a given level of accuracy, particularly for the places closer to first, which are most important for estimating your fair share of the prize money.

The distribution of the other stacks matters, but except in extreme situations, you only see a large effect if you are close to the "money," which means that there are at most a few more players than there are prizes. Let's assume the prize structure is the one PokerStars uses for $180$ player tournaments: $0.3, 0.2, 0.119, 0.08, 0.065, 0.05, 0.035, 0.026, 0.017$ for places $1-9$, and a flat $0.012$ for places $10-18$.

Let's consider $2$ pairs of situations. First, you are one of $180$ equal stacks. Your equity is $1/180$ of the prize pool, or $0.5556\%$. Suppose you have doubled up, eliminating one player, and you have $178$ opponents with a stack half as large as yours. According to the ICM, your chance to finish in first place is $1.1111\%$, second $1.1049\%$, ... $18$th $1.0056\%$ for an equity of $1.0917\%$. The quotient $0.5556/1.0917 = 50.887\%$ is how much equity you need to want to get all-in with no dead money with $180$ equal stacks.

Suppose there are $60$ players with a stack equal to yours (including you), $60$ with half of your stack, and $60$ with half again as much as your stack. According to the ICM, your equity is $0.5572\%$ of the prize pool. Next, suppose you double up against an equal stack. Your equity increases to $1.0948\%$ of the prize pool. The equity you need to risk elimination for this is $0.5572/1.0948 = 50.894\%$.

Your expected share of the prize money didn't depend much on the stacks of the other players, and the equity you need to risk your whole stack depended even less on the stacks of the other players. These become sensitive to the stacks of your opponents once you get down to about $25-30$ players left with $18$ prizes.

Solved – What’s the probability that from 25 random numbers between 1 and 100, the highest appears more than once

Let

$x$ be the top end of your range, $x=100$ in your case.
$n$ be the total number of draws, $n=25$ in your case.

For any number $y\le x$, the number of sequences of $n$ numbers with each number in the sequence $\le y$ is $y^n$. Of these sequence, the number containing no $y$s is $(y-1)^n$, and the number containing one $y$ is $n(y-1)^{n-1}$. Hence the number of sequences with two or more $y$s is $$y^n - (y-1)^n - n(y-1)^{n-1}$$ The total number of sequences of $n$ numbers with highest number $y$ containing at least two $y$s is \begin{align} \sum_{y=1}^x \left(y^n - (y-1)^n - n(y-1)^{n-1}\right) &= \sum_{y=1}^x y^n - \sum_{y=1}^x(y-1)^n - \sum_{y=1}^xn(y-1)^{n-1}\\ &= x^n - n\sum_{y=1}^x(y-1)^{n-1}\\ &= x^n - n\sum_{y=1}^{x-1}y^{n-1}\\ \end{align}

The total number of sequences is simply $x^n$. All sequences are equally likely and so the probability is $$ \frac{x^n - n\sum_{y=1}^{y=x-1}y^{n-1}}{x^n}$$

With $x=100,n=25$ I make the probability 0.120004212454.

I've tested this using the following Python program, which counts the sequences that match manually (for low $x,n$), simulates and calculates using the above formula.

import itertools
import numpy.random as np

def countinlist(x, n):
    count = 0
    total = 0
    for perm in itertools.product(range(1, x+1), repeat=n):
        total += 1
        if perm.count(max(perm)) > 1:
            count += 1

    print "Counting: x", x, "n", n, "total", total, "count", count

def simulate(x,n,N):
    count = 0
    for i in range(N):
        perm = np.randint(x, size=n)
        m = max(perm)
        if sum(perm==m) > 1:
            count += 1
    print "Simulation: x", x, "n", n, "total", N, "count", count, "prob", count/float(N)

x=100
n=25
N = 1000000 # number of trials in simulation

#countinlist(x,n) # only call this for reasonably small x and n!!!!
simulate(x,n,N)
formula = x**n - n*sum([i**(n-1) for i in range(x)])
print "Formula count", formula, "out of", x**n, "probability", float(formula) / x**n

This program outputted

Simulation: x 100 n 25 total 1000000 count 120071 prob 0.120071
Formula count 12000421245360277498241319178764675560017783666750 out of 100000000000000000000000000000000000000000000000000 probability 0.120004212454

Best Answer

Related Solutions

Solved – Probability of winning a tournament

Solved – What’s the probability that from 25 random numbers between 1 and 100, the highest appears more than once

Related Question