Set of random numbers with uniform distribution – justify the distribution of the differences

probabilityprobability distributionsstatisticsuniform distribution

Make a set of random integers, uniformly distributed between 0 and n=10^4. The size of the set is n^2.

import pandas as pd
import numpy as np
n = 10 ** 4
seq = pd.DataFrame(np.random.randint(0, n, n ** 2))

Plot the histogram and indeed the distribution of values is uniform:

seq.hist(bins=n)

Now, from each element k in the set, subtract the element k-1 and keep the absolute value of the difference. Drop the first element since there's no previous element to that.

diff = seq.diff().dropna().abs().astype(int)

The result is also a set of random numbers with values between 0 and n. But what is the histogram of this new set? Let's see:

diff.hist(bins=n)

Why is the new distribution skewed so much towards small values? Intuitively I expected a bell curve, maybe centered on the average.

What this distribution says is – the most likely difference between any k-1 and k is 0.

Can you provide an intuitive explanation?

Best Answer

Here is an intuitive explanation:

Let's call the two consecutive samples $X, Y$. Since $X$ and $Y$ are drawn independently and from a uniform distribution, we have $P((X,Y) = (i,j)) = P(X=i)P(Y=j) = \frac{1}{(n+1)^2}$ fixed for all pairs $0 \leq i,j \leq n$. So, for a fixed $z \in \{0,1,...,n\}$, $$P(|X-Y| = z)=\sum_{\substack{0 \leq i,j \leq n \\ |i-j|=z}} P((X,Y) = (i,j))$$ boils down to counting how many different pair differences leads to the given $z$ value, since all pairs are equally likely.

For $z=n$, there are only two pairs: $(n,0)$ and $(0,n)$. For $z=n-1$, there are four pairs: $(n, 1), (n-1, 0), (1, n)$ and $(0, n-1)$. And so on.

If you work out the number of such pairs using simple combinatorics, you'll see that moving from $z=n$ to $z=1$, the number of pairs are increasing linearly, which matches what you are seeing at your last plot.

Best Answer

Related Solutions

[Math] the proper way to compare discrete set of data to a continuous probability distribution

[Math] Distribution of the sum of many lognormal random numbers from same distribution

Related Question