Solved – Extending the birthday paradox to more than 2 people

birthday paradoxcombinatoricsprobability

In the traditional Birthday Paradox the question is "what are the chances that two or more people in a group of $n$ people share a birthday". I'm stuck on a problem which is an extension of this.

Instead of knowing the probability that two people share a birthday, I need to extend the question to know what is the probability that $x$ or more people share a birthday. With $x=2$ you can do this by calculating the probability that no two people share a birthday and subtract that from $1$, but I don't think I can extend this logic to larger numbers of $x$.

To further complicate this I also need a solution which will work for very large numbers for $n$ (millions) and $x$ (thousands).

Best Answer

This is a counting problem: there are $b^n$ possible assignments of $b$ birthdays to $n$ people. Of those, let $q(k; n, b)$ be the number of assignments for which no birthday is shared by more than $k$ people but at least one birthday actually is shared by $k$ people. The probability we seek can be found by summing the $q(k;n,b)$ for appropriate values of $k$ and multiplying the result by $b^{-n}$.

These counts can be found exactly for values of $n$ less than several hundred. However, they will not follow any straightforward formula: we have to consider the patterns of ways in which birthdays can be assigned. I will illustrate this in lieu of providing a general demonstration. Let $n = 4$ (this is the smallest interesting situation). The possibilities are:

Each person has a unique birthday; the code is {4}.
Exactly two people share a birthday; the code is {2,1}.
Two people have one birthday and the other two have another; the code is {0,2}.
Three people share a birthday; the code is {1,0,1}.
Four people share a birthday; the code is {0,0,0,1}.

Generally, the code $\{a[1], a[2], \ldots\}$ is a tuple of counts whose $k^\text{th}$ element stipulates how many distinct birthdates are shared by exactly $k$ people. Thus, in particular,

$$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$$

Note, even in this simple case, that there are two ways in which the maximum of two people per birthday is attained: one with the code $\{0,2\}$ and another with the code $\{2,1\}$.

We can directly count the number of possible birthday assignments corresponding to any given code. This number is the product of three terms. One is a multinomial coefficient; it counts the number of ways of partitioning $n$ people into $a[1]$ groups of $1$, $a[2]$ groups of $2$, and so on. Because the sequence of groups does not matter, we have to divide this multinomial coefficient by $a[1]!a[2]!\cdots$; its reciprocal is the second term. Finally, line up the groups and assign them each a birthday: there are $b$ candidates for the first group, $b-1$ for the second, and so on. These values have to be multiplied together, forming the third term. It is equal to the "factorial product" $b^{(a[1]+a[2]+\cdots)}$ where $b^{(m)}$ means $b(b-1)\cdots(b-m+1)$.

There is an obvious and fairly simple recursion relating the count for a pattern $\{a[1], \ldots, a[k]\}$ to the count for the pattern $\{a[1], \ldots, a[k-1]\}$. This enables rapid calculation of the counts for modest values of $n$. Specifically, $a[k]$ represents $a[k]$ birthdates shared by exactly $k$ people each. After these $a[k]$ groups of $k$ people have been drawn from the $n$ people, which can be done in $x$ distinct ways (say), it remains to count the number of ways of achieving the pattern $\{a[1], \ldots, a[k-1]\}$ among the remaining people. Multiplying this by $x$ gives the recursion.

I doubt there is a closed form formula for $q(k; n, b)$, which is obtained by summing the counts for all partitions of $n$ whose maximum term equals $k$. Let me offer some examples:

With $b=5$ (five possible birthdays) and $n=4$ (four people), we obtain

$$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$$

Whence, for example, the chance that three or more people out of four share the same "birthday" (out of $5$ possible dates) equals $(80 + 5)/625 = 0.136$.

As another example, take $b = 365$ and $n = 23$. Here are the values of $q( k;23,365)$ for the smallest $k$ (to six sig figs only):

$$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$$

Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger $n$ we need an approximation.

One approximation is obtained by means of the Poisson distribution with expectation $n/b$, because we can view a birthday assignment as arising from $b$ almost (but not quite) independent Poisson variables each with expectation $n/b$: the variable for any given possible birthday describes how many of the $n$ people have that birthday. The distribution of the maximum is therefore approximately $F(k)^b$ where $F$ is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for $n = 23$, $b = 365$ gives

$$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$$

By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of $n$ and $b$ suggests the approximation is usually about this good.

To wrap up, let's consider the original question: take $n = 10,000$ (the number of observations) and $b = 1\,000\,000$ (the number of possible "structures," approximately). The approximate distribution for the maximum number of "shared birthdays" is

$$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$$

(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because $n$ and $b$ are both large, I expect the approximation to work quite well here.

Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like

simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];

which is then iterated and summarized, as in this example which runs 10,000 iterations of the $n = 10000$, $b = 1\,000\,000$ case:

Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm

Its output is

2 8503

3 1493

4 4

These frequencies closely agree with those predicted by the Poisson approximation.

Related Solutions

Probability – What is the Real Answer to the Birthday Question?

Luckily someone has posted some genuine birthday data with a bit of discussion of a related question (is the distribution uniform). We can use this and resampling to show that the answer to your question is apparently 23 - the same as the theoretical answer.

> x <- read.table("bdata.txt", header=T)
> birthday <- data.frame(date=as.factor(x$date), count=x$count)
> summary(birthday) 
      date         count     
 101    :  1   Min.   : 325  
 102    :  1   1st Qu.:1266  
 103    :  1   Median :1310  
 104    :  1   Mean   :1314  
 105    :  1   3rd Qu.:1362  
 106    :  1   Max.   :1559  
 (Other):360                 
> results <- rep(0,50)
> reps <-2000 # big number needed as there is some instability otherwise
> for (i in 1:50)
+ {
+ count <- 0
+ for (j in 1:reps)
+ {
+ samp <- sample(birthday$date, i, replace=T, prob=birthday$count)
+ count <- count + 1*(max(table(samp))>1)
+ }
+ results[i] <- count/reps
+ }
> results
 [1] 0.0000 0.0045 0.0095 0.0220 0.0210 0.0395 0.0570 0.0835 0.0890 0.1165
[11] 0.1480 0.1770 0.1955 0.2265 0.2490 0.2735 0.3105 0.3350 0.3910 0.4165
[21] 0.4690 0.4560 0.5210 0.5310 0.5745 0.5975 0.6240 0.6430 0.6950 0.7015
[31] 0.7285 0.7510 0.7690 0.8025 0.8225 0.8280 0.8525 0.8645 0.8685 0.8830
[41] 0.8965 0.9020 0.9240 0.9435 0.9350 0.9465 0.9545 0.9655 0.9600 0.9665

Probability – Birthday Paradox with a Huge Twist: Probability of Sharing Exact Same Date of Birth with Partner

For any one relationship, the odds of sharing the same month and day are approximately 1 in 365 (not exactly because of leap year and because births are not exactly evenly spaced within a year. If you add in year, it's probably something like 1 in 3000 or 4000 (most people have relationships with people relatively close in age).

But that' a priori.

That is, if you had asked, before meeting your current boyfriend "What are the odds that the next man I have a relationship with will be born on same day and year?" the odds would have been 1 in 3000 or so.

However, post hoc (that is, while in the relationship) it's trickier because you would have noticed a lot of other coincidences too: My boyfriend was born the day before me! My boyfriend's mother has the same name as my mother!" etc etc.

The odds of "some weird connection with my boyfriend" are impossible to calculate.

Best Answer

Related Solutions

Probability – What is the Real Answer to the Birthday Question?

Probability – Birthday Paradox with a Huge Twist: Probability of Sharing Exact Same Date of Birth with Partner

Related Question