Solved – How to determine xdy dice roll from a sample

dice

I'm playing an RPG based on DnD xdy dice damage rolls, where x is the number of dice rolled and y is the number of sides on each die. So, normal double-sixes dice people are familiar with would be 2d6. These dice rolls generate a discrete distribution, which approaches a normal distribution as more dice are added.

I would like to know if there exists a statistical method which, when given a sample of damage output (where xdy is unknown), will allow me to match it to some xdy roll.

So, given say 1000 hits, I find I have a range of 3-18, a mean of 10.71, and a standard deviation of ~3.01. This looks pretty close to a 3d6, but is there some statistical test I can run to generate some probability estimate, confidence interval/level, etc?

I realize that certain heuristic methods can help to some extent, such as if I have a "3" in my sample, then the "x" in xdy must be less than or equal to 3. Similarly, knowing the mean and St.Dev can often give good hints. However, the difference between say a 5d6 and 6d5 is fairly small, and a very large sample might still have "6" as the lowest observed sample value for a 5d6 roll.

Best Answer

Sure there is.

Let's assume that we could see $k$ different types of dice with $Y=\{y_1, \dots, y_k\}$ pips. The "classical" D&D dice would yield $Y=\{2, 4, 6, 8, 10, 12, 20, 100\}$, but anything else could be made to work as below. Let $|Y|$ denote the number of different possible dice.

Assume that the minimum observed roll is $m$ and the maximum is $M$. This gives us bounds on the possible number of dice involved: we must have

$$m\leq x\leq M.$$

Now, simulate a large number of rolls from each remaining candidate xdy (by the formula above, there are $(M-m+1)\times |Y|$ different candidates). This large number can be larger than the number of original observations. In each case, create a histogram of total outcomes.

Finally, compare these density histograms to the histogram of the original observations. (We work with the densities, not the frequencies, so we can use different numbers of simulations from the original number of observations.) The earth mover's distance gives you a notion of distance between histograms. The simulation-based histogram with the lowest earth mover's distance to your original observation histogram is most consistent with your observations.

Here is some R code. We simulate 1000 rolls of 3d6 and use the standard D&D dice as candidates $Y$.

set.seed(1)
obs <- rowSums(matrix(ceiling(runif(3000,0,6)),nrow=1000))
candidates.y <- c(4,6,8,10,12,20,100)   # standard D&D dice

breaks <- c(seq(.5,max(obs)+.5,by=1),max(obs)*max(candidates.y))
hist.obs <- hist(obs,breaks=breaks,freq=FALSE)

candidates.x <- 1:max(obs)

n.sims <- 1000

library(emdist)
emd.dist <- matrix(NA,nrow=length(candidates.x),ncol=length(candidates.y),
    dimnames=list(candidates.x,candidates.y))
for ( ii in seq_along(candidates.x) ) {
    for ( jj in seq_along(candidates.y) ) {
        set.seed(1) # for replicability
        obs.sim <- rowSums(matrix(ceiling(
            runif(candidates.x[ii]*n.sims,0,candidates.y[jj])),
          nrow=n.sims))
        hist.sim <- hist(obs.sim,breaks=breaks,plot=FALSE)
        emd.dist[ii,jj] <- emd2d(matrix(hist.obs$density,nrow=1),
          matrix(hist.sim$density,nrow=1))
    }
}
emd.dist

Output:

           4         6         8        10         12        20        100
1  7.9029999 6.8909998 5.8920002 4.8950000 3.88700008 1.9228243 0.39887184
2  5.4299998 3.4460001 1.4499999 0.8147613 1.08018708 0.8099990 0.04417419
3  2.9579999 0.0000000 1.8063331 1.9216927 1.61559486 0.2684011 0.35928789
4  0.6180000 2.4049284 2.2750988 1.3528987 0.71173453 0.2177625 1.00000000
5  1.9768735 2.8317218 1.7545075 0.5033562 0.17459151 0.1226299 1.00000000
6  3.4169219 1.9913402 0.4674550 0.1918677 0.05264558 1.0000000 1.00000000
7  3.3924465 0.9393466 0.2968451 0.1572183 0.21881166 1.0000000 1.00000000
8  2.6134274 0.3687483 0.1226299 1.0000000 1.00000000 1.0000000 1.00000000
9  1.5483801 0.3894975 0.3592879 1.0000000 1.00000000 1.0000000 1.00000000
10 0.3270817 0.2188117 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
11 0.1052912 0.3592879 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
12 0.3592879 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
13 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
14 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
15 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
16 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
17 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
18 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000

And look, the 3d6 in fact does have the simulated histogram with the lowest distance from your observations' histogram. Hurray!

A couple of comments:

  • This works better if you have "many" observations. With few observations, you get a lot of noise, and many different simulated histograms are consistent with your noisy data.
  • Of course, the larger your candidate set $Y$, the lower your chance will be of getting the true values.
  • If the true $y$ is not in $Y$, you will at least get something "close to it".
  • The formally-statistically-correct treatment would be a Bayesian one, putting equal prior probabilities on all possible $x$ and $y\in Y$, then calculating or simulating draws and deriving posterior probabilities on pairs $(x,y)$. Yes, this can be done rigorously. Anyone want to have a go?