Solved – Verifying that a random generator outputs a uniform distribution

distributionshistogramhypothesis testingprobabilityuniform distribution

I asked a student of mine this question:

If you have a random number generator that outputs a number between
$1$ and $k$, how would you write a test that decides whether the
generated distribution is uniform ?

The student understood that we need to call the generator $n$ times and save a histogram $h_i$.

However I would expect the end test to be:
$$\sum{(h_i-\frac{n}{k})^2}<t$$
Where $t$ is a threshold that depends on $n$ and $k$

And what the student suggested is
$$1-t<\frac{h_i}{n/k}<1+t$$
Again, for some $t$ that is a function of $n$ and $k$

Is the student correct ?
Are those two ways equivalent ?

Best Answer

UPDATE 2: striked wrong points, and replaced some by stuff I think are correct.

UPDATE 1: added direct analysis of both your uniformity tests, and kept my old answer as a proposed superior uniformity test that I suggest.

Summary

Both you and your student have offered uniformity tests that work.
Both of your tests don't work well with continuous numbers (i.e. if RNG spits out numbers in $[1,k]$ instead of $\{1,2,\ldots,k\}$.
Since the output of your uniformity score is not a probability, I think it has the disadvantage of being difficult to interpret.
~~Since your student's uniformity score is a probability, I think it has the advantage of being easy to interpret.~~
While not harmful to the correctness of the test, your student's uniformity test has one aspect that is useless (but also harmless; just redundant), namely: there is not point to add $t$ on right side of the inequality as the output is a probability.
~~One small error in your student's uniformity test is using $<$ instead of $\le$ for the upper bound. I think he/she should have used $\le$.~~

On the other hand, I think my proposed test is:

Unlike your tests, mine works with continuous numbers.
Similar to your student's test, mine is also easy to interpret cause it's a probability as well.

The disadvantage is that my test is possibly harder to compute. But this is not a big deal as I think the correctness of the test is worth the slight increase in difficulty.

Analysing your uniformity tests

Your notation confused me a bit. I assume that you mean that $h$ is a histogram, and $h_i$ is the frequency that is associated with input $i$ where $i$ is some number between $1$ and $k$ (which your RNG produces).

So to rewrite your test, I think you wanted to say this: \begin{equation} \text{teacher uniformity score} = \sum_{i \in \{1,2,\ldots,k\}} (h_i - \frac{n}{k})^2 \end{equation}

Then the uniformity test: the RNG is uniformly distributed if the uniformity score is less than some threshold $t_{teacher}$ that we agree upon. I.e. output from RNG is uniformly distributed if the following statement is true: \begin{equation} \text{teacher uniformity score} < t_{teacher} \end{equation}

I see why your method makes sense. Basically, your score is essentially the sum of squared errors of observed frequencies against expected frequencies, where expected is what should happen if the RNG is perfectly uniform. So if the sum of squared errors is minimal, the more uniform it is.

So my opinion about your test is this:

Pros:
- it does have merit, and it does reflect the degree of uniformity of the RNG.
Cons:
- it is not easy to interpret because the output is not a probability. If you augment it such that the output is a probability, then it would be easier to interpret. At least I can't interpret it easily. This interpret-ability is a subjective thing.
- it is not friendly with continuous RNGs. You can't use histograms with continuous numbers without approximating numbers into bins, which essentially deletes information and opens the vulnerability that might potentially lead to masking non-uniformity of some poor RNGs. We need to look at the probability density functions (PDFs) instead.

What you wrote about what your student claimed is also confusing, but I rewrite that to what I think is the closest thing that makes best sense in my view (kindly correct me if you think the student didn't say this): \begin{equation} \text{student uniformity score} = \sum_{i \in \{1,2,\ldots,k\}} \frac{h_i}{n/k} \end{equation}

This is essentially the ratio between observed histogram and expected histogram (expected is the perfect uniform one). Clearly, if the ratio is 1 then it's perfectly uniform. But if it isn't (i.e. greater or less than 1) then it is not perfectly uniform.

So your student is suggesting that the closer the ratio is to 1, the more random it is. So $t_{student}$ is essentially some kind of error term that accounts for deviations from 1: \begin{equation} 1-t_{student} \le \text{student uniformity score} \le 1+t_{student} \end{equation}

My opinion about your student's uniformity test is as follows:

Pros:
- It does have merit for the same reason yours does.
- ~~It is easy to interpret (thanks to it being a probability). This is subjective though. I think most humans agree with me (correct me if you disagree).~~
Cons:
- It's not easy to interpret just like yours.
- It doesn't work well with continuous RNGs for the same reason yours doesn't.

My test which I think is superior to both of your tests

Based on the question, you only seem to be interested in finding whether a sequence is uniformly distributed. I.e. it doesn't need to satisfy any other property.

If your PRNG is outputting some float, e.g. something in $[0,1]$, I would personally suggest doing this (which seems somehow similar to your student's suggest? I don't know..):

Repeat PRNG $n$ times, where $n$ is large enough.
Estimate the true probability density function $f_X$. Let $\hat f_X$ be your best estimation.

Then your sequence is perfectly uniformly distributed if $\hat f_X(x)$ forms a line with a slope of 0. This is usually unachievable in reality.

So you need to define a degree of uniformity that, if satisfied, you subjectively declare that the PRNG is uniformly distributed.

To do this, I would personally suggest to define the null hypothesis:

A sequence is uniform if the probability of each number to appear is $1/n$.

You then compute the probability based on your empirical trials.

Finally, you do some variant of Fisher's exact test to find the probability that your sequence can exist if the null hypothesis is true.

Once you get that probability, it is there where you plug your threshold. You and your student need to agree on this threshold. It's subjective.

In this specific approach, I'd expect you to agree on a very large probability that is very close to $1$. Maybe $0.999$, or whatever you and your student seem to be happy with.

Or maybe do the opposite: define the null hypothesis as:

Your sequence is not uniform, and probability of each number to appear is not $1/n$.

Then, you need to choose a a very small threshold in order to reject it.

Related Solutions

Solved – Test the randomness (uniformly distributed) on a 64 bit float random generator

As it stands, this is not a good way to test whether floating point numbers are uniformly distributed. Like Aksakal, I wondered about whether the bits of the exponent part of the floating point representation would be uniformly distributed. The answer to this is that they aren't uniformly distributed, because there are very many more numbers with large exponents than there are numbers with small exponents.

I wrote a small test program that confirms this. It generates $N = 1 \text{ million}$ uniformly distributed random floating point numbers, and as a control, $N$ random integers. (There were various problems generating 64 bit floating point numbers, see e.g. here, and 32 bits seems sufficient for demonstration purposes.)

First, the control case. The plot of the bins of bits for integers is just as you suggested, with each bin $\approx N/2$. enter image description here

Now for the floating point numbers. A plot of the sorted numbers is a straight line, indicating that they would pass the Kolmogorov–Smirnov test for uniformity. enter image description here

But the bins are definitely not uniform. enter image description here

If you plot only bins 1 to 23 together with bin 32, you do get bins $\approx N/2$, but bins 24 to 31 show a clear increasing pattern. These bits corresponds precisely with the bits for the exponent in 32 bit floating point numbers. The IEEE single precision floating point definition stipulates

the least significant 23 bits are for the mantissa
the next 8 bits are for the exponent
the most significant bit is for the sign

Another way to see this is to consider a simpler example. Think about generating numbers in base 10 between 0 and $10^7$, with a base 10 exponent. Numbers between 0 and 1 would have an exponent of 0. Numbers between 1 and 10 would have an exponent of 1, numbers between 10 and 100 an exponent of 2, ..., and numbers between $10^6$ and $10^7$ an exponent of 7. The numbers $10^4$ to $10^7$ are $(10^7-10^4)/10^7=99.9\%$ of the range and in binary their exponents range from 001 to 111, so you'd expect the most significant bit to occur 99.9% of the time, not 50% of the time.

It would be possible, with some care, to use an approach like this to get the expected frequencies for each bin in the binary exponent of a floating point number, and use this in a $\chi^2$ test, but Kolmogorov–Smirnov is a better approach in theory and easy to implement. Nevertheless a test like this could pick up distributional biases in the implementation of a random number generation that Kolmogorov–Smirnov might not. For example, when I first tried generating 64 bit double precision floating point random numbers in C++, I forgot to change to a 64 bit Mersenne Twister engine. The sorted numbers give a straight line plot, but you can see from the plots of the bins of the bits that the 64 bit Mersenne Twister engine is superior to the 32 bit one (as you would expect).

enter image description here

(Note in both cases that the last bit, the sign bit, is zero, due to the difficulties of generating random numbers across the whole range.)

Solved – Generate random numbers from “sloped uniform distribution” from mathematical theory

Your deriviation is ok. Note that to get a positive density on $(0,B)$, you have to constraint $$ B^2 \tan\phi < 2. $$ In your code $B = 25$ so you should take $\phi$ between $\pm\tan^{-1}{2\over 625}$, that's where your code fails.

You can (and should) avoid using a quadratic solver, and then select the roots between 0 and $B$. The quadratic polynomial equation in $x$ to be solved is $$F(x) = t$$ with $$ F(x) = {1\over 2} \tan \phi \cdot x^2 + \left( {1\over B} - {B\over 2} \tan \phi \right) x.$$ By construction $F(0) = 0$ and $F(B) = 1$; also $F$ increases on $(0,B)$.

From this it is easy to see that if $\tan \phi > 0$, the portion of parabola in which we are interested is a part of the right side of the parabola, and the root to keep is the highest of the two roots, that is $$ x = {1\over \tan \phi} \left( {B\over 2} \tan \phi - {1\over B} + \sqrt{ \left( {B\over 2} \tan \phi - {1\over B} \right)^2 + 2 \tan \phi \cdot t}. \right)$$ To the contrary, if $\tan\phi < 0$, the parabola is upside down, and we are interest in its left part. The root to keep is the lowest one. Taking into account the sign of $\tan\phi$ it appears that this is the same root (ie the one with $+\sqrt\Delta$) than in the first case.

Here is some R code.

phi <- pi/8; B <- 2
f <- function(t) (-(1/B - 0.5*B*tan(phi)) + 
       sqrt( (1/B - 0.5*B*tan(phi))**2 + 2 * tan(phi) * t))/tan(phi)
hist(f(runif(1e6)))

And with $\phi < 0$:

phi <- -pi/8
hist(f(runif(1e6)))