Suppose, calculating the probability of having at least two peoples same birthday from 25 people. What is the sample space and outcome of the experiment?
As far as I pondered, S = 365^25 and outcome 365C25
Solved – Sample space and outcome of birthday problem
birthday paradoxmathematical-statisticsprobabilistic-programmingprobabilitysampling
Related Solutions
This is a counting problem: there are $b^n$ possible assignments of $b$ birthdays to $n$ people. Of those, let $q(k; n, b)$ be the number of assignments for which no birthday is shared by more than $k$ people but at least one birthday actually is shared by $k$ people. The probability we seek can be found by summing the $q(k;n,b)$ for appropriate values of $k$ and multiplying the result by $b^{-n}$.
These counts can be found exactly for values of $n$ less than several hundred. However, they will not follow any straightforward formula: we have to consider the patterns of ways in which birthdays can be assigned. I will illustrate this in lieu of providing a general demonstration. Let $n = 4$ (this is the smallest interesting situation). The possibilities are:
- Each person has a unique birthday; the code is {4}.
- Exactly two people share a birthday; the code is {2,1}.
- Two people have one birthday and the other two have another; the code is {0,2}.
- Three people share a birthday; the code is {1,0,1}.
- Four people share a birthday; the code is {0,0,0,1}.
Generally, the code $\{a[1], a[2], \ldots\}$ is a tuple of counts whose $k^\text{th}$ element stipulates how many distinct birthdates are shared by exactly $k$ people. Thus, in particular,
$$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$$
Note, even in this simple case, that there are two ways in which the maximum of two people per birthday is attained: one with the code $\{0,2\}$ and another with the code $\{2,1\}$.
We can directly count the number of possible birthday assignments corresponding to any given code. This number is the product of three terms. One is a multinomial coefficient; it counts the number of ways of partitioning $n$ people into $a[1]$ groups of $1$, $a[2]$ groups of $2$, and so on. Because the sequence of groups does not matter, we have to divide this multinomial coefficient by $a[1]!a[2]!\cdots$; its reciprocal is the second term. Finally, line up the groups and assign them each a birthday: there are $b$ candidates for the first group, $b-1$ for the second, and so on. These values have to be multiplied together, forming the third term. It is equal to the "factorial product" $b^{(a[1]+a[2]+\cdots)}$ where $b^{(m)}$ means $b(b-1)\cdots(b-m+1)$.
There is an obvious and fairly simple recursion relating the count for a pattern $\{a[1], \ldots, a[k]\}$ to the count for the pattern $\{a[1], \ldots, a[k-1]\}$. This enables rapid calculation of the counts for modest values of $n$. Specifically, $a[k]$ represents $a[k]$ birthdates shared by exactly $k$ people each. After these $a[k]$ groups of $k$ people have been drawn from the $n$ people, which can be done in $x$ distinct ways (say), it remains to count the number of ways of achieving the pattern $\{a[1], \ldots, a[k-1]\}$ among the remaining people. Multiplying this by $x$ gives the recursion.
I doubt there is a closed form formula for $q(k; n, b)$, which is obtained by summing the counts for all partitions of $n$ whose maximum term equals $k$. Let me offer some examples:
With $b=5$ (five possible birthdays) and $n=4$ (four people), we obtain
$$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$$
Whence, for example, the chance that three or more people out of four share the same "birthday" (out of $5$ possible dates) equals $(80 + 5)/625 = 0.136$.
As another example, take $b = 365$ and $n = 23$. Here are the values of $q( k;23,365)$ for the smallest $k$ (to six sig figs only):
$$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$$
Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger $n$ we need an approximation.
One approximation is obtained by means of the Poisson distribution with expectation $n/b$, because we can view a birthday assignment as arising from $b$ almost (but not quite) independent Poisson variables each with expectation $n/b$: the variable for any given possible birthday describes how many of the $n$ people have that birthday. The distribution of the maximum is therefore approximately $F(k)^b$ where $F$ is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for $n = 23$, $b = 365$ gives
$$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$$
By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of $n$ and $b$ suggests the approximation is usually about this good.
To wrap up, let's consider the original question: take $n = 10,000$ (the number of observations) and $b = 1\,000\,000$ (the number of possible "structures," approximately). The approximate distribution for the maximum number of "shared birthdays" is
$$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$$
(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because $n$ and $b$ are both large, I expect the approximation to work quite well here.
Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like
simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];
which is then iterated and summarized, as in this example which runs 10,000 iterations of the $n = 10000$, $b = 1\,000\,000$ case:
Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm
Its output is
2 8503
3 1493
4 4
These frequencies closely agree with those predicted by the Poisson approximation.
The important issue with the birthday problem is that each person's BIRTHDAY is independent. Your point that the chance of collisions increases as the number of people in the room increases is exactly the point of the calculation.
An example of a case where birthdays are not independent would be, testing the chance of collision at a twin convention.
The way I think about the solution is to take the inverse of the case where everybody has different birthdays. When two people are in the room, the odds of no collision are $\frac{365}{365} * \frac{364}{365}$, so the odds of a collision are $1-\frac{365}{365} * \frac{364}{365}=\frac{1}{365}$. When a third person enters the room, the odds of no collision are $\frac{365}{365} * \frac{364}{365}* \frac{363}{365}$, so the odds of a collision are $1-\frac{365}{365}*\frac{364}{365}*\frac{363}{365}=\frac{1093}{133225}=1-\frac{\frac{365!}{(365-n)!}}{365^n}$
Best Answer
The sample space is the set of all possible outcomes of the experiment, corresponding to the Cartesian product of the set of $365$ possible birth dates (after hedging for pertinent caveats as to the possibility of leap years, seasonality in births, etc) with itself as many times as the number of individuals in the room.
This will produce $365^n$ ordered pairs, or $365^{25}$ $25$-tuples. From Wikipedia:
From this sample space, the event of getting two people with the same birthday can be assigned a probability. Being that we are dealing with a discrete probability space, outcomes and events can be treated as interchangeable.
The actual calculation is not $\binom {365}{25}$. The denominator will be the number of elements in the sample space, $365^{25}$, and the numerator will be much more straightforward to figure out if instead we calculate the complementary: the probability of having no matches, which can be calculated easily thinking that each person "occupies" a date that no other person can share:
$$\Pr(\text{at least 2 people share birthday})=1- \frac{365\times364\times\cdots\times \overbrace{341}^{365-n+1}}{365^{25}}$$