Probability distributions for sorted observations of a random variable, such as the nth-highest roll from k dice.

diceprobability distributionsrandom variablesstatistics

This seems to be a bit of an unorthodox question since I haven't been able to find any questions quite like it on this site. The closest I can find are those pertaining specifically to the scenarios of rolling dice and ignoring some number of either only the lowest or only the highest. So, I'll first describe a more general case of the problem for anyone else who might need to know how the distribution of ordered observations of a random variable works.

Note on notation: As per Wikipedia, I'm using $\left[ a .. b \right]$ to mean the set of integers between $a$ and $b$—that is, $\left[ a .. b \right] = \left[ a,b \right] \cap \mathbb{N}$


General Case

Take $k$ observations, $a_1$ through $a_k$, of some random variable $A$ with distribution $\mathcal{A}$. Put them in a list $b$, ordered from smallest to largest, so that $b_n$ is the $n$th-smallest value of $\{ a_i \mid i \in [1 .. k]\}$. What is the probability distribution of $b_n$? I will refer to this distribution as $\Omega(\mathcal{A}, k, n)$.

Example

Let's say $k$ = 5 and $A$ is a standard uniform random variable. That is, $A \sim \mathcal{U}(0,1)$.

We obtain five observations of $\mathcal{U}(0,1)$:

  • $a = ( .376, .531, .826, .896, .166 )$

and then we sort them:

  • $b = ( .166, .376, .531, .826, .896 )$

Each $b_n$ is an observation of $\Omega(\mathcal{U}(0,1), 5, n)$. For example, 0.531 is an observation of $\Omega(\mathcal{U}(0,1), 5, 3)$.


Specific Case

The particular reason I came across this problem is for tabletop role-playing games, which typically use dice for determining random outcomes. A common method for modifying the shape of probability distributions without changing the minimum and maximum possible values is to have players perform one or more extra rolls and then discard the highest or lowest results before summing the remaining values. For example, roll a 20-sided dice twice and discard the highest, or roll a 6-sided dice four times and discard the lowest. When designing new game mechanics that use this type of random selection, it would be good to know what the actual probability distributions are. To determine the probability distribution of a sum of certain rolls in a sorted set, you need to know the probability distribution of each die after sorting.

So, my specific case is a subset of the general case for a discrete uniform distribution with domain $[1..s]$, where s is the number of sides on the die. I will call this distribution $D(s)$. It has the following probability mass function and cumulative distribution function:

$p(x; s) = \left\{\begin{array}{ll}\frac{1}{s} & \quad x \in [1 .. s] \\ 0 & \quad \text{otherwise} \end{array}\right.$

$F(x; s) = \left\{\begin{array}{ll} 0 & \quad x \lt 1 \\ \frac{\lfloor x \rfloor}{s} & \quad 1 \leq x \leq s \\ 1 & \quad x \gt s \end{array}\right.$

What I'm looking for is a way to get the probability distribution for the $n$th-smallest value from rolling an $s$-sided die $k$ times. Formally, I want to find the probability mass function $p(x; s, k, n)$ and/or the cumulative distribution function $F(x; s, k, n)$ of $\Omega(D(s), k, n)$.

Example

Let's say $k$ = 4 and $A$ is a random variable that represents the outcome of rolling a 20-sided die. That is, $A \sim D(20)$.

We obtain four observations of $D(20)$:

  • $a = ( 19, 10, 10, 13)$

and then we sort them:

  • $b = ( 10, 10, 13, 19 )$

Each $b_n$ is an observation of $\Omega(D(20), 4, n)$. For example, the second 10 is an observation of $\Omega(D(20), 4, 2)$.


What I am personally looking for is just a solution to the dice problem. However, out of curiosity as to whether it is even possible, I do wonder if it is possible to come up with a more general composite function that handles the general case for any random variable. I'm sure there are use cases for needing to know the distribution of things like the middle half of observations from a normal distribution, or the tenth-highest outcome out of a hundred from a geometric distribution.

Best Answer

You want the probability that the $n$th-smallest value from rolling an $s$-sided die $k$ times is $x$.

This is the probability that at least $n$ of the $k$ rolls give $x$ or less minus the probability that at least $n$ of the $k$ rolls give $x-1$ or less

which is the same as the probability that no more than $n-1$ of the $k$ rolls give $x-1$ or less minus the probability that no more than $n-1$ of the $k$ rolls give $x$ or less

which you can write as $$\sum\limits_{j=0}^{n-1} {k \choose j}\frac{(x-1)^j(s-x+1)^{k-j}-x^j(s-x)^{k-j}}{s^n}$$

but if you have a binomial CDF function to hand, such as in R, it might be easier to write:

pbinom(n-1, k, (x-1)/s) - pbinom(n-1, k, x/s) 

making the probability that the second smallest value of four $D(20)$ rolls is $10$ equal to $0.07848125$ (the value of $7$ is more likely, with a probability of $0.08871875$, the median is $8$ and the expected value is $8.5000125$)


The CDF is $1$ minus the probability that no more than $n-1$ of the $k$ rolls give $x$ or less so

$$1- \sum\limits_{j=0}^{n-1} {k \choose j}\frac{x^j(s-x)^{k-j}}{s^n}$$

or in R

 1 - pbinom(n-1, k, x/s) 
Related Question