Buckets of random numbers

probabilityrandom

Let's imagine that there are 1000 buckets,
and a collection of 100000 items with random numbers between [1..1000].
Let's also suppose the distribution of random numbers is considered "truly random", i.e. all numbers have exactly the same probability to appear (it probably has a name, I just don't know it) (edit : as mentioned by @Arthur in the comments, this is called "uniformly random").

Now, obviously, 100000 numbers distributed over 1000 buckets, I expect to have approximately 100 items per buckets.

But I can't expect to have exactly 100 items for each and every bucket. This would be regular. But truly random will necessarily generate some variations, with some buckets as low as 90, and others as high as 110, for example.

Now, I want to find a "threshold" above which one should be worried about the "randomness" of the distribution. Say, if one buckets gets 500 items, it's probably a big warning that the distribution is unlikely to be random (with some probability associated). Alternatively, or in complement, I could look at the "smallest largest" value acceptable, such as if the largest bucket has 101 items, it's probably too regular to be uniformly random.

The question is : what is an "acceptable" range of variations which proves / suggest that the distribution is truly random ?

I've searched an answer to this question many times, but my google-fu is not good enough, presumably because I don't know hence don't use the right names and keywords for these concepts.

Best Answer

Like Arthur said, the thing you're looking for is a $p$-value; the apparatus that will give it is a hypothesis test.

There are lots of ways that you can do this. The chi-squared test, suggested in the comments, is a good place to start; see also the Multinomial test. These are good tools to measure the overall "randomness" of a sample and give basic tools to tell whether some observed sample was or was not likely to have come from a purported definition. However, I'll admit that there's a fairly high barrier of entry on these articles, and they're not necessarily great introductions to the subject of hypothesis testing.

There's another workaround to get what you're asking for, and it may very well be simpler if you're not already well-versed in the language of hypothesis tests. If you want to specifically focus on the maximum number of objects in the same buckets, for instance, all you need to know is how that typically looks under the scenario you outlined above (distributing the items randomly to each bucket). That can be monstrously complicated to deal with theoretically, but it's not too hard to write some code to simulate that phenomenon. The basic idea is to code up the phenomenon you're interested in, then repeat it many, many times, and observe what happens across the many different simulations.

The process I'll outline is an instance of bootstrapping, and it's a great candidate for this problem because (1) this problem is conceptually very simple, but (2) it's analytically very tedious to actually work with. Here's some pseudocode for what bootstrapping might look like if you want to investigate the biggest bucket:

num_of_trials = 555555555         # repeat this process many times (see * below)
max_bucket[1:num_of_trials] = 0   # store a vector of all 0's to later record max bucket sizes
for i = 1 to num_of_trials
{
  buckets[1:1000] = 0             # store a vector of all 0, of length 1000
  for j = 1 to 100000             # j iterates the different balls
  {
    draw = random(1, 1000)        # assign a random integer between 1 and 1000
    buckets[draw]++               # increment the buckets vector in the position indicated by "draw"
  }
max_bucket[i] = max(buckets)      # write down highest bucket occupancy for this trial
}

At the end, you'll have a vector (max_bucket) containing various different observed largest bucket size values. You could do a lot with this, such as making a histogram to see how the maximum bucket size generally looks. More importantly, you could find a threshold; you could establish what value constitutes the upper 5% of these values, and you could say that any max that comes out that high or higher is too extreme. Note that 5% is an arbitrary threshold, but it's also a common convention for these purposes.

Note the power of this approach; you could trivially modify it to investigate the smallest bucket, the gap between the largest and smallest bucket, or anything else that's easy to describe mathematically. This method gives you a quick way to focus on whatever thing you want to investigate and determine whether it's beyond the pale for typical results.

*Technical note: In reality, running 555,555,555 trials is probably too many for a typical computer; you'd want to start with something smaller and work up to whatever is a tolerable program runtime for you. More trials is better, but you'll run into long runtimes (and potentially memory errors) if you do too many. One quick way to check if you have enough trials is to run the above process twice and see if the results look terribly different from one another. If they do, you probably need more trials.

Since Cyan was already familiar with simulation approaches, I decided that my post was neither an answer nor particularly helpful. In hopes of completing it somewhat, here's some working R code:

num_of_trials <- 10000
max_values <- replicate(num_of_trials, 0)
for (i in 1:num_of_trials) {
  max_values[i] <- max(table(sample(1:1000, 100000, replace = TRUE)))
}
hist(max_values)
quantile(max_values, 0.95)

Outputs: The 95th percentile was 141, so a single bucket of size $\fbox{142 or higher}$ would raise suspicion if we set our significance level at the common value of $5\%$.

NB: I know that for() loops are frowned upon in R, but when I attempted to do this as a single vectorized operation the resulting matrix was too large to store in memory. The runtime for this code on my average machine was about 6 minutes.

Related Solutions

[Math] Random number generation inside an interval based on cdf (Zipf and Exponential)

The quantile function $x = Q(q)$ maps unit interval into the domain of your distribution, which is $\mathbb{R}^+$ in your case. One can perform the truncation of this distribution to any interval measurable with your distribution. Let it be $a<=x<b$. Then the c.d.f. of the truncated distribution will be $F_{[a,b)}(x) = \chi_{[a,b)}(x) (F(x)-F(a) )/(F(b)-F(a))$, where the $\chi$ is the indicator function for the interval. It is also easy to find quantile of this truncated distribution. $Q_{[a,b)}(q) = Q(F(a) + q*(F(b)-F(a))$. This last statement gives your algorithm to generate random numbers.

Here is the simulation experiment for standard exponential truncated to interval $[0,2)$.

histogram of random sample generated by the described method

EDIT

Quantile of Zipf distribution is not available in closed form, but can be numerically computed by finding the $$Q(q) = \mathop{sup} \{ x \in \mathcal{D} : cdf(x) \leq q \}$$

Once this is implemented, you can generate random variates from the truncated Zipf distribution (albeit the method is not very efficient) as so:

lb=1; ub=6;
di = ZipfDistribution[3/2];
fa = CDF[di, lb-1]; fb = CDF[di, ub];
Q[q_] := Quantile[di, fa + q*(fb - fa)]

Histogram of truncated zipf random sample

[Math] Simple random sampling without replacement of huge dataset

A classical algorithm to generate a random permutation of an array (with uniform probability) can be adapted for this. Let us assume that your data is stored in an array.

Let data_size be the size of your data
Let curr_sample_size be the number of samples taken so far, initialized at $0$
While curr_sample_size is strictly lower than the desired sample size $S$, do
- Let pos be a random position between curr_sample_size and data_size$-1$
- swap the elements at position curr_sample_size and pos in your data array
- increase the value of curr_sample_size by $1$
Return the curr_sample_size first values of your data array

If you need to leave your data array in the same order, you can record the list of pos you generated, and swap back the array afterwards, once you recovered all the samples at the beginning of the array. Given the list of positions, putting back the array into place is done with

Let curr_sample_pos be initialized from the desired sample size to $S-1$
For each position pos from the last one to the first one, do
- swap the elements at position pos and curr_sample_pos in the data array
- decrease the value of curr_sample_pos by $1$

Best Answer

Related Solutions

[Math] Random number generation inside an interval based on cdf (Zipf and Exponential)

[Math] Simple random sampling without replacement of huge dataset

Related Question