Solved – Fake uniform random numbers: More evenly distributed than true uniform data

distributionsquasi-monte-carlorandom-generationuniform distribution

I'm looking for a way to generate random numbers that appear to be uniform distributed — and every test will show them to be uniform — except that they are more evenly distributed than true uniform data.

The problem I have with the "true" uniform randoms is that they will occasionally cluster. This effect is stronger at a low sample size. Roughly said: when I draw two Uniform randoms in U[0;1], chances are around 10% that they are within a range of 0.1, and 1% that they are within 0.01.

So I'm looking for a good way to generate random numbers that are more evenly distributed than uniform randoms.

Use case example: say I'm doing a computer game, and I want to place treasure randomly on a map (not caring about any other thing). I don't want the treasure to be all in one place, it should be all over the map. With uniform randoms, if I place, say, 10 objects, the chances are not that low that there are 5 or so really close to each other. This may give one player an advantage over another. Think of minesweeper, chances (albeit low, if there are enough mines) are that you are really lucky and win with a single click.

A very naive approach for my problem is to divide the data into a grid. As long as the number is large enough (and has factors), one can enforce extra uniformness this way. So instead of drawing 12 random variables from U[0;1], I can draw 6 from U[0;.5] and 6 from U[0.5;1], or 4 from U[0;1/3] + 4 from U[1/3;2/3] + 4 from U[2/3; 1].

Is there any better way to get this extra evenness into the uniform? It probably only works for batch randoms (when drawing a single random, I obviously have to consider the whole range). In particular, I can shuffle the records again afterwards (so it's not the first four from the first third).

How about doing it incrementally? So the first is on U[0;1], then two from each halves, one from each third, one from each fourth? Has this been investigated, and how good is it?
I might have to be careful to use different generators for x and y to not get them correlated (the first xy would always be in the bottom half, the second in the left half and bottom third, the third in center third and top third… so at least some random bin permutation is also needed. and in the long run, it will be too even, I guess.

As a side node, is there a well-known test whether some distribution is too evenly distributed to be truly uniform? So testing "true uniform" vs. "someone messed with the data and distributed the items more evenly". If I recall correctly, Hopkins Statistic can measure this, but can it be used for testing, too? Also somewhat an inverse K-S-Test: if the largest deviation is below a certain expected threshold, the data is too evenly distributed?

Best Answer

Yes, there are many ways to produce a sequence of numbers that are more evenly distributed than random uniforms. In fact, there is a whole field dedicated to this question; it is the backbone of quasi-Monte Carlo (QMC). Below is a brief tour of the absolute basics.

Measuring uniformity

There are many ways to do this, but the most common way has a strong, intuitive, geometric flavor. Suppose we are concerned with generating $n$ points $x_1,x_2,\ldots,x_n$ in $[0,1]^d$ for some positive integer $d$. Define $$\newcommand{\I}{\mathbf 1} D_n := \sup_{R \in \mathcal R}\,\left|\frac{1}{n}\sum_{i=1}^n \I_{(x_i \in R)} - \mathrm{vol}(R)\right| \>, $$ where $R$ is a rectangle $[a_1, b_1] \times \cdots \times [a_d, b_d]$ in $[0,1]^d$ such that $0 \leq a_i \leq b_i \leq 1$ and $\mathcal R$ is the set of all such rectangles. The first term inside the modulus is the "observed" proportion of points inside $R$ and the second term is the volume of $R$, $\mathrm{vol}(R) = \prod_i (b_i - a_i)$.

The quantity $D_n$ is often called the discrepancy or extreme discrepancy of the set of points $(x_i)$. Intuitively, we find the "worst" rectangle $R$ where the proportion of points deviates the most from what we would expect under perfect uniformity.

This is unwieldy in practice and difficult to compute. For the most part, people prefer to work with the star discrepancy, $$ D_n^\star = \sup_{R \in \mathcal A} \,\left|\frac{1}{n}\sum_{i=1}^n \I_{(x_i \in R)} - \mathrm{vol}(R)\right| \>. $$ The only difference is the set $\mathcal A$ over which the supremum is taken. It is the set of anchored rectangles (at the origin), i.e., where $a_1 = a_2 = \cdots = a_d = 0$.

Lemma: $D_n^\star \leq D_n \leq 2^d D_n^\star$ for all $n$, $d$.
Proof. The left hand bound is obvious since $\mathcal A \subset \mathcal R$. The right-hand bound follows because every $R \in \mathcal R$ can be composed via unions, intersections and complements of no more than $2^d$ anchored rectangles (i.e., in $\mathcal A$).

Thus, we see that $D_n$ and $D_n^\star$ are equivalent in the sense that if one is small as $n$ grows, the other will be too. Here is a (cartoon) picture showing candidate rectangles for each discrepancy.

extremal and star discrepancy

Examples of "good" sequences

Sequences with verifiably low star discrepancy $D_n^\star$ are often called, unsurprisingly, low discrepancy sequences.

van der Corput. This is perhaps the simplest example. For $d=1$, the van der Corput sequences are formed by expanding the integer $i$ in binary and then "reflecting the digits" around the decimal point. More formally, this is done with the radical inverse function in base $b$, $$\newcommand{\rinv}{\phi} \rinv_b(i) = \sum_{k=0}^\infty a_k b^{-k-1} \>, $$ where $i = \sum_{k=0}^\infty a_k b^k$ and $a_k$ are the digits in the base $b$ expansion of $i$. This function forms the basis for many other sequences as well. For example, $41$ in binary is $101001$ and so $a_0 = 1$, $a_1 = 0$, $a_2 = 0$, $a_3 = 1$, $a_4 = 0$ and $a_5 = 1$. Hence, the 41st point in the van der Corput sequence is $x_{41} = \rinv_2(41) = 0.100101\,\text{(base 2)} = 37/64$.

Note that because the least significant bit of $i$ oscillates between $0$ and $1$, the points $x_i$ for odd $i$ are in $[1/2,1)$, whereas the points $x_i$ for even $i$ are in $(0,1/2)$.

Halton sequences. Among the most popular of classical low-discrepancy sequences, these are extensions of the van der Corput sequence to multiple dimensions. Let $p_j$ be the $j$th smallest prime. Then, the $i$th point $x_i$ of the $d$-dimensional Halton sequence is $$ x_i = (\rinv_{p_1}(i), \rinv_{p_2}(i),\ldots,\rinv_{p_d}(i)) \>. $$ For low $d$ these work quite well, but have problems in higher dimensions.

Halton sequences satisfy $D_n^\star = O(n^{-1} (\log n)^d)$. They are also nice because they are extensible in that the construction of the points does not depend on an a priori choice of the length of the sequence $n$.

Hammersley sequences. This is a very simple modification of the Halton sequence. We instead use $$ x_i = (i/n, \rinv_{p_1}(i), \rinv_{p_2}(i),\ldots,\rinv_{p_{d-1}}(i)) \>. $$ Perhaps surprisingly, the advantage is that they have better star discrepancy $D_n^\star = O(n^{-1}(\log n)^{d-1})$.

Here is an example of the Halton and Hammersley sequences in two dimensions.

Halton and Hammersley

Faure-permuted Halton sequences. A special set of permutations (fixed as a function of $i$) can be applied to the digit expansion $a_k$ for each $i$ when producing the Halton sequence. This helps remedy (to some degree) the problems alluded to in higher dimensions. Each of the permutations has the interesting property of keeping $0$ and $b-1$ as fixed points.

Lattice rules. Let $\beta_1, \ldots, \beta_{d-1}$ be integers. Take $$ x_i = (i/n, \{i \beta_1 / n\}, \ldots, \{i \beta_{d-1}/n\}) \>, $$ where $\{y\}$ denotes the fractional part of $y$. Judicious choice of the $\beta$ values yields good uniformity properties. Poor choices can lead to bad sequences. They are also not extensible. Here are two examples.

Good and bad lattices

$(t,m,s)$ nets. $(t,m,s)$ nets in base $b$ are sets of points such that every rectangle of volume $b^{t-m}$ in $[0,1]^s$ contains $b^t$ points. This is a strong form of uniformity. Small $t$ is your friend, in this case. Halton, Sobol' and Faure sequences are examples of $(t,m,s)$ nets. These lend themselves nicely to randomization via scrambling. Random scrambling (done right) of a $(t,m,s)$ net yields another $(t,m,s)$ net. The MinT project keeps a collection of such sequences.

Simple randomization: Cranley-Patterson rotations. Let $x_i \in [0,1]^d$ be a sequence of points. Let $U \sim \mathcal U(0,1)$. Then the points $\hat x_i = \{x_i + U\}$ are uniformly distributed in $[0,1]^d$.

Here is an example with the blue dots being the original points and the red dots being the rotated ones with lines connecting them (and shown wrapped around, where appropriate).

Cranley Patterson

Completely uniformly distributed sequences. This is an even stronger notion of uniformity that sometimes comes into play. Let $(u_i)$ be the sequence of points in $[0,1]$ and now form overlapping blocks of size $d$ to get the sequence $(x_i)$. So, if $s = 3$, we take $x_1 = (u_1,u_2,u_3)$ then $x_2 = (u_2,u_3,u_4)$, etc. If, for every $s \geq 1$, $D_n^\star(x_1,\ldots,x_n) \to 0$, then $(u_i)$ is said to be completely uniformly distributed. In other words, the sequence yields a set of points of any dimension that have desirable $D_n^\star$ properties.

As an example, the van der Corput sequence is not completely uniformly distributed since for $s = 2$, the points $x_{2i}$ are in the square $(0,1/2) \times [1/2,1)$ and the points $x_{2i-1}$ are in $[1/2,1) \times (0,1/2)$. Hence there are no points in the square $(0,1/2) \times (0,1/2)$ which implies that for $s=2$, $D_n^\star \geq 1/4$ for all $n$.

Standard references

The Niederreiter (1992) monograph and the Fang and Wang (1994) text are places to go for further exploration.

Related Question