I might be overthinking this. I generated the output in R and 5 of my 10 samples were successful, so that's 50%. Given that, if I am to estimate the probability of two or more people in a group of 30 sharing a birthday, what is my total sample? Should I be using combinations?
Solved – Birthday paradox: How to estimate the probability of two or more people in a group of 30 sharing a birthday
birthday paradoxprobabilityr
Related Solutions
This is a counting problem: there are $b^n$ possible assignments of $b$ birthdays to $n$ people. Of those, let $q(k; n, b)$ be the number of assignments for which no birthday is shared by more than $k$ people but at least one birthday actually is shared by $k$ people. The probability we seek can be found by summing the $q(k;n,b)$ for appropriate values of $k$ and multiplying the result by $b^{-n}$.
These counts can be found exactly for values of $n$ less than several hundred. However, they will not follow any straightforward formula: we have to consider the patterns of ways in which birthdays can be assigned. I will illustrate this in lieu of providing a general demonstration. Let $n = 4$ (this is the smallest interesting situation). The possibilities are:
- Each person has a unique birthday; the code is {4}.
- Exactly two people share a birthday; the code is {2,1}.
- Two people have one birthday and the other two have another; the code is {0,2}.
- Three people share a birthday; the code is {1,0,1}.
- Four people share a birthday; the code is {0,0,0,1}.
Generally, the code $\{a[1], a[2], \ldots\}$ is a tuple of counts whose $k^\text{th}$ element stipulates how many distinct birthdates are shared by exactly $k$ people. Thus, in particular,
$$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$$
Note, even in this simple case, that there are two ways in which the maximum of two people per birthday is attained: one with the code $\{0,2\}$ and another with the code $\{2,1\}$.
We can directly count the number of possible birthday assignments corresponding to any given code. This number is the product of three terms. One is a multinomial coefficient; it counts the number of ways of partitioning $n$ people into $a[1]$ groups of $1$, $a[2]$ groups of $2$, and so on. Because the sequence of groups does not matter, we have to divide this multinomial coefficient by $a[1]!a[2]!\cdots$; its reciprocal is the second term. Finally, line up the groups and assign them each a birthday: there are $b$ candidates for the first group, $b-1$ for the second, and so on. These values have to be multiplied together, forming the third term. It is equal to the "factorial product" $b^{(a[1]+a[2]+\cdots)}$ where $b^{(m)}$ means $b(b-1)\cdots(b-m+1)$.
There is an obvious and fairly simple recursion relating the count for a pattern $\{a[1], \ldots, a[k]\}$ to the count for the pattern $\{a[1], \ldots, a[k-1]\}$. This enables rapid calculation of the counts for modest values of $n$. Specifically, $a[k]$ represents $a[k]$ birthdates shared by exactly $k$ people each. After these $a[k]$ groups of $k$ people have been drawn from the $n$ people, which can be done in $x$ distinct ways (say), it remains to count the number of ways of achieving the pattern $\{a[1], \ldots, a[k-1]\}$ among the remaining people. Multiplying this by $x$ gives the recursion.
I doubt there is a closed form formula for $q(k; n, b)$, which is obtained by summing the counts for all partitions of $n$ whose maximum term equals $k$. Let me offer some examples:
With $b=5$ (five possible birthdays) and $n=4$ (four people), we obtain
$$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$$
Whence, for example, the chance that three or more people out of four share the same "birthday" (out of $5$ possible dates) equals $(80 + 5)/625 = 0.136$.
As another example, take $b = 365$ and $n = 23$. Here are the values of $q( k;23,365)$ for the smallest $k$ (to six sig figs only):
$$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$$
Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger $n$ we need an approximation.
One approximation is obtained by means of the Poisson distribution with expectation $n/b$, because we can view a birthday assignment as arising from $b$ almost (but not quite) independent Poisson variables each with expectation $n/b$: the variable for any given possible birthday describes how many of the $n$ people have that birthday. The distribution of the maximum is therefore approximately $F(k)^b$ where $F$ is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for $n = 23$, $b = 365$ gives
$$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$$
By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of $n$ and $b$ suggests the approximation is usually about this good.
To wrap up, let's consider the original question: take $n = 10,000$ (the number of observations) and $b = 1\,000\,000$ (the number of possible "structures," approximately). The approximate distribution for the maximum number of "shared birthdays" is
$$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$$
(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because $n$ and $b$ are both large, I expect the approximation to work quite well here.
Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like
simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];
which is then iterated and summarized, as in this example which runs 10,000 iterations of the $n = 10000$, $b = 1\,000\,000$ case:
Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm
Its output is
2 8503
3 1493
4 4
These frequencies closely agree with those predicted by the Poisson approximation.
Luckily someone has posted some genuine birthday data with a bit of discussion of a related question (is the distribution uniform). We can use this and resampling to show that the answer to your question is apparently 23 - the same as the theoretical answer.
> x <- read.table("bdata.txt", header=T)
> birthday <- data.frame(date=as.factor(x$date), count=x$count)
> summary(birthday)
date count
101 : 1 Min. : 325
102 : 1 1st Qu.:1266
103 : 1 Median :1310
104 : 1 Mean :1314
105 : 1 3rd Qu.:1362
106 : 1 Max. :1559
(Other):360
> results <- rep(0,50)
> reps <-2000 # big number needed as there is some instability otherwise
> for (i in 1:50)
+ {
+ count <- 0
+ for (j in 1:reps)
+ {
+ samp <- sample(birthday$date, i, replace=T, prob=birthday$count)
+ count <- count + 1*(max(table(samp))>1)
+ }
+ results[i] <- count/reps
+ }
> results
[1] 0.0000 0.0045 0.0095 0.0220 0.0210 0.0395 0.0570 0.0835 0.0890 0.1165
[11] 0.1480 0.1770 0.1955 0.2265 0.2490 0.2735 0.3105 0.3350 0.3910 0.4165
[21] 0.4690 0.4560 0.5210 0.5310 0.5745 0.5975 0.6240 0.6430 0.6950 0.7015
[31] 0.7285 0.7510 0.7690 0.8025 0.8225 0.8280 0.8525 0.8645 0.8685 0.8830
[41] 0.8965 0.9020 0.9240 0.9435 0.9350 0.9465 0.9545 0.9655 0.9600 0.9665
Best Answer
How are you generating your birthdays? To generate 23 birthdays:
To see if 2 or more share the same birthday:
How often is the above TRUE?
This closely matches the theoretical value of 50.7% in the wikipedia page