Buckets of random numbers

probabilityrandom

Let's imagine that there are 1000 buckets,
and a collection of 100000 items with random numbers between [1..1000].
Let's also suppose the distribution of random numbers is considered "truly random", i.e. all numbers have exactly the same probability to appear (it probably has a name, I just don't know it) (edit : as mentioned by @Arthur in the comments, this is called "uniformly random").

Now, obviously, 100000 numbers distributed over 1000 buckets, I expect to have approximately 100 items per buckets.

But I can't expect to have exactly 100 items for each and every bucket. This would be regular. But truly random will necessarily generate some variations, with some buckets as low as 90, and others as high as 110, for example.

Now, I want to find a "threshold" above which one should be worried about the "randomness" of the distribution. Say, if one buckets gets 500 items, it's probably a big warning that the distribution is unlikely to be random (with some probability associated). Alternatively, or in complement, I could look at the "smallest largest" value acceptable, such as if the largest bucket has 101 items, it's probably too regular to be uniformly random.

The question is : what is an "acceptable" range of variations which proves / suggest that the distribution is truly random ?

I've searched an answer to this question many times, but my google-fu is not good enough, presumably because I don't know hence don't use the right names and keywords for these concepts.

Best Answer

Like Arthur said, the thing you're looking for is a $p$-value; the apparatus that will give it is a hypothesis test.

There are lots of ways that you can do this. The chi-squared test, suggested in the comments, is a good place to start; see also the Multinomial test. These are good tools to measure the overall "randomness" of a sample and give basic tools to tell whether some observed sample was or was not likely to have come from a purported definition. However, I'll admit that there's a fairly high barrier of entry on these articles, and they're not necessarily great introductions to the subject of hypothesis testing.

There's another workaround to get what you're asking for, and it may very well be simpler if you're not already well-versed in the language of hypothesis tests. If you want to specifically focus on the maximum number of objects in the same buckets, for instance, all you need to know is how that typically looks under the scenario you outlined above (distributing the items randomly to each bucket). That can be monstrously complicated to deal with theoretically, but it's not too hard to write some code to simulate that phenomenon. The basic idea is to code up the phenomenon you're interested in, then repeat it many, many times, and observe what happens across the many different simulations.

The process I'll outline is an instance of bootstrapping, and it's a great candidate for this problem because (1) this problem is conceptually very simple, but (2) it's analytically very tedious to actually work with. Here's some pseudocode for what bootstrapping might look like if you want to investigate the biggest bucket:

num_of_trials = 555555555         # repeat this process many times (see * below)
max_bucket[1:num_of_trials] = 0   # store a vector of all 0's to later record max bucket sizes
for i = 1 to num_of_trials
{
  buckets[1:1000] = 0             # store a vector of all 0, of length 1000
  for j = 1 to 100000             # j iterates the different balls
  {
    draw = random(1, 1000)        # assign a random integer between 1 and 1000
    buckets[draw]++               # increment the buckets vector in the position indicated by "draw"
  }
max_bucket[i] = max(buckets)      # write down highest bucket occupancy for this trial
}

At the end, you'll have a vector (max_bucket) containing various different observed largest bucket size values. You could do a lot with this, such as making a histogram to see how the maximum bucket size generally looks. More importantly, you could find a threshold; you could establish what value constitutes the upper 5% of these values, and you could say that any max that comes out that high or higher is too extreme. Note that 5% is an arbitrary threshold, but it's also a common convention for these purposes.

Note the power of this approach; you could trivially modify it to investigate the smallest bucket, the gap between the largest and smallest bucket, or anything else that's easy to describe mathematically. This method gives you a quick way to focus on whatever thing you want to investigate and determine whether it's beyond the pale for typical results.

*Technical note: In reality, running 555,555,555 trials is probably too many for a typical computer; you'd want to start with something smaller and work up to whatever is a tolerable program runtime for you. More trials is better, but you'll run into long runtimes (and potentially memory errors) if you do too many. One quick way to check if you have enough trials is to run the above process twice and see if the results look terribly different from one another. If they do, you probably need more trials.


Since Cyan was already familiar with simulation approaches, I decided that my post was neither an answer nor particularly helpful. In hopes of completing it somewhat, here's some working R code:

num_of_trials <- 10000
max_values <- replicate(num_of_trials, 0)
for (i in 1:num_of_trials) {
  max_values[i] <- max(table(sample(1:1000, 100000, replace = TRUE)))
}
hist(max_values)
quantile(max_values, 0.95)

Outputs: histogram of max values The 95th percentile was 141, so a single bucket of size $\fbox{142 or higher}$ would raise suspicion if we set our significance level at the common value of $5\%$.

NB: I know that for() loops are frowned upon in R, but when I attempted to do this as a single vectorized operation the resulting matrix was too large to store in memory. The runtime for this code on my average machine was about 6 minutes.