[Math] the distribution of the results of a round robin tournament? What is the distribution of the number of winners

graph theoryprobabilitystatistics

With the ATP tour finals tennis tournament going on right now, I got to thinking about round robin tournaments. I learned upon searching that in graph theory the tournament models a round robin, in which each player plays every other player exactly once, and either wins or loses (no draws). But after some searching I was not able to find answers to the questions I posed in the title.

Suppose there are $n$ players in a tournament. The result of a tournament is described as a graph in which each pair of vertices is connected by exactly one directed edge, where an edge going from vertex $i$ to vertex $j$ indicates that player $i$ beat player $j$. Call $g$ the number of edges in the graph, which reflects the number of games played in the tournament. Of course, $g = n(n-1)/2$, the triangular number of $n-1$ (imagine the cells above the diagonal of a tournament crosstable). Clearly, there are $2^g$ possible results of a tournament.

Suppose every player in the tournament is equally skilled, so that every result is equally likely. For a given result, we can count the wins of each player and rank the win counts in a non-decreasing sequence $s = (s_1,\ldots,s_n)$ where $0 \leq s_1 \leq \cdots \leq s_n$. My first question is: what in general would be the probability distribution for $s$? Relatedly, what would be the expectation for each $s_i$?

To get an intuition for this last question, I simulated 10,000 instances of a round robin tournament of 100 equally skilled players. The plot below shows the average of $s$, where the x axis reflects the index $i$ and the y axis the number of wins $s_i$.

This plot

Now let the number of winners of the tournament be denoted $ w = |\{s_i : s_i = \max s\}|$. What is the probability that $w=1$, that is, that the tournament has an outright winner? In general, what is the probability distribution of $w$?

Using the data from the same simulation as above, I plotted the relative frequencies of $w$ values:

frequencies.

Does anyone know of any results that bear on these questions? When I thought of these questions, I figured the answers would be standard results, but after a moderate amount of looking I haven't come across them.

Best Answer

Here is a heuristic argument for the mean of the order statistics $s_1,s_2,\ldots,s_n$.

Let's define a random variable $X_{ij}$ as follows

$$X_{ij}=\begin{cases} 1 & \text{, if player $i$ beats player $j$,}\\ 0 & \text{, if player $i$ loses to player $j$.}\end{cases}$$

Then, if we assume all the players are equally skilled, the $X_{ij}\sim \text{Ber}(0.5)$, i.e. they are Bernoulli random variables with probability of success $0.5$. However, they are not independent. Indeed, we can more precisely state that

$$\begin{eqnarray}X_{ij}\sim\text{Ber}(0.5) & \;\; , & \;\; \text{ for } i<j \\ X_{ij}=1-X_{ji} & \;\; , &\;\; \text{ for } i>j \\ X_{ii}=0 & \;\; , &\;\; \text{ players don't play against themselves } \\ \end{eqnarray} $$

This defines an $n\times n$ matrix whose components are the individual match results. The rowsums corresponds to the total amount of wins, the colomn sums to the total amount of losses.

$$W_i=\sum_{j=1}^n X_{ij} \;\; \text{ and } \;\; L_j=\sum_{i=1}^n X_{ij}$$

Of course, the losses and wins of a player count up to the total amount of matches played

$$W_i+L_i=n-1 \; .$$

We are interested in the order statistics $s_1=W_{(1)},s_2=W_{(2)},\ldots,s_n=W_{(n)}$. $W_{(i)}$ is the i-th element of the ascendingly sorted vector of $(W_1,W_2,\ldots,W_n)$.

Solving this problem is a pretty hard nut to crack. So, I resorted to a heuristic argument that only covers the expectation of the order statistics. What is easy to find is the marginal distribution of a player $i$. These will all be the same and binomial: $W_i\sim \text{Bin}(n-1,0.5)$.

Now, I reasoned that since an individual player could rank anywhere in the hierarchy, the possible outcomes of $W_i$ encode precisely that. Moreover, one could interpret the quantiles of the distribution as follows: if player $i$ ends up to be the player that wins most matches, this means that he is somewhere in the $1/n$th upper quantile of the distribution. If he loses the most matches, he'll be in the $1/n$th lower quantile. Etc, for other quantiles. So, I reasoned that essentially, the function for the means of the $s_i$ is just the quantile distribution of an individual player's wins.

So I did a small simulation in R and found that this indeed gives good results. I simulated 10000 roundrobin tournaments of 100 players. (Don't mind the probably inefficient code.)

n<-100;
mtres<-rep(0,n)

nsim<-10000

for (t in 1:nsim) {

    tourres<-matrix(rep(0,n*n),nrow=n,ncol=n)
    for (k in 2:n) {
        for (l in 1:(k-1)) {
            tourres[k,l]<-sample(c(0,1),1)
            tourres[l,k]<-1-tourres[k,l]
            } 
        }
    mtres<-mtres+sort(rowSums(tourres)) 
}

Then I plotted the mean of the $s_1,s_2,\ldots,s_n$ against the quantiles of the binomial (I added a shift of 0.5 continuity correction) and also the normal approximation of the binomial.

plot(mtres/nsim,type='l')

me<-(n-1)/2
se<-sqrt((n-1)/4)

curve(qnorm(x/n,me,se),from=0,to=n,add=T,col='red')
curve(qbinom((x-0.5)/n,n-1,1/2),from=0,to=n,add=T,col='blue') 

This is the result:

Roundrobin tournament mean order statistics

You almost don't see the difference between the normal approximation and the simulated result. (Although in the tails the difference will be larger)

Here's also the comparison between the simulated results and the binomial quantiles:

> mtres/nsim
  [1] 37.0529 38.8178 39.8103 40.5246 41.1004 41.5688 41.9787 42.3477 42.6726
 [10] 42.9847 43.2639 43.5249 43.7751 44.0082 44.2256 44.4402 44.6479 44.8379
 [19] 45.0221 45.2040 45.3828 45.5611 45.7286 45.8976 46.0504 46.2097 46.3602
 [28] 46.5153 46.6663 46.8137 46.9510 47.0826 47.2206 47.3658 47.5138 47.6548
 [37] 47.7893 47.9121 48.0304 48.1495 48.2822 48.4198 48.5623 48.6912 48.8136
 [46] 48.9286 49.0407 49.1549 49.2819 49.4216 49.5650 49.7027 49.8218 49.9305
 [55] 50.0414 50.1550 50.2857 50.4243 50.5592 50.7022 50.8349 50.9580 51.0779
 [64] 51.1992 51.3318 51.4758 51.6228 51.7688 51.9080 52.0362 52.1782 52.3212
 [73] 52.4726 52.6264 52.7842 52.9417 53.0956 53.2592 53.4346 53.6118 53.7920
 [82] 53.9683 54.1521 54.3493 54.5514 54.7600 54.9891 55.2233 55.4744 55.7365
 [91] 56.0231 56.3181 56.6480 57.0165 57.4239 57.9007 58.4657 59.1880 60.1857
[100] 61.9515
> qbinom(((1:n)-0.5)/n,n-1,1/2)
  [1] 37 39 40 41 41 42 42 42 43 43 43 44 44 44 44 44 45 45 45 45 45 46 46 46 46
 [26] 46 46 47 47 47 47 47 47 47 48 48 48 48 48 48 48 48 49 49 49 49 49 49 49 49
 [51] 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 52 52 52 52 52 52 52 53 53
 [76] 53 53 53 53 54 54 54 54 54 55 55 55 55 55 56 56 56 57 57 57 58 58 59 60 62
> 

If you round the simulation results to the nearest integer, the difference is zero.

A more able statistician than me can probably make this rigorous and add error margins to this result. But I thought it was already neat enough to post this answer.

EDIT : I wanted to add some additional findings. For small values of $n$, the above approximations will be abysmally bad. But, in those cases, it is not to hard to compute by hand the distributions of $(s_1,s_2,\ldots,s_n)$.

Take the case $n=3$ for instance. There's really only two situations possible, being $(1,1,1)$ and $(0,1,2)$. By looking at the matrices I defined before, it's not difficult to see that you can realize the first situation in $1/4$ cases and the second in $3/4$ cases. From there you can compute means, variance, all you want.

Take the case $n=4$. Then you have 4 different possibilities: $(0,1,2,3)$ with probability $4!/2^6$, $(1,1,1,3)$ with probability $8/2^6$, $(0,2,2,2)$ with probability $8/2^6$ and finally $(1,1,2,2)$ with probability $4!/2^6$. You can check that all the realization are really partitions of the number $6=4(4-1)/2$ with the restriction that no element of the partition is larger than 3, plus additional restrictions. The mean values are easily computed to be $(0.5,1.125,1.875,2.5)$. And indeed the simulation agrees with this result.

I wouldn't be surprised if the problem can be recast into the theory of random partitions.

EDIT2 I wrote a little program that computes the exact probability distribution. But it's not very efficient memory or time wise. I had done the computation by hand for a 5 player tournament, and then I tried for a 6 player but decided it would be wiser to program the rules. The result is this

n<-6
m<-n*(n-1)/2
store<-c()

for (k in 0:(2^m-1)) {
    x<-as.numeric(intToBits(k))[1:m]
    y<-1-x
    vec1<-rep(0,n)
    for (ell in (n-1):1) {
        vec1[n-ell]<-sum(x[1:ell])
        x<-x[(ell+1):length(x)]
    }
    vec2<-rep(0,n)
    for (ell in (n-1):1) {
        temp<-seq(1,1+ell*(n-ell-1),ell)
        vec2[n-ell+1]<-sum(y[temp])
        y<-y[-temp]
    }
    store<-c(store,toString(sort(vec1+vec2)))   
}

table(store)
table(store)/2^m
barplot(table(store)/2^m)

I tried to run it for $n=7$ (which is about 64 times more computations and more storage space needed than for $n=6$) but I did not have the patience to wait for the end result. So, Monte Carlo simulation is more efficient to determine round robin tournament results. Or my heuristic if you look at tournaments with very high numbers of participants.