To distribute the number $n$ into $p$ parts, you would calculate the
“truncating integer division” of $n$ divided by $p$, and the
corresponding remainder.
Mathematically that is (assuming that both $n$ and $p$ are strictly
positive integers)
$$
d = \left\lfloor \frac np \right\rfloor \, , \quad r = n \bmod p = n-pd \, .
$$
In many programming languages you would do something like
int d = n / p; // truncating integer division
int r = n % p; // remainder
Then $$n = pd + r = r(d+1) + (p-r)d $$
so that the desired partition is
$$
\underbrace{d+1, \ldots, d+1}_{r \text{ times}}, \underbrace{d, \ldots, d}_{p-r \text{ times}}
$$
Like Arthur said, the thing you're looking for is a $p$-value; the apparatus that will give it is a hypothesis test.
There are lots of ways that you can do this. The chi-squared test, suggested in the comments, is a good place to start; see also the Multinomial test. These are good tools to measure the overall "randomness" of a sample and give basic tools to tell whether some observed sample was or was not likely to have come from a purported definition. However, I'll admit that there's a fairly high barrier of entry on these articles, and they're not necessarily great introductions to the subject of hypothesis testing.
There's another workaround to get what you're asking for, and it may very well be simpler if you're not already well-versed in the language of hypothesis tests. If you want to specifically focus on the maximum number of objects in the same buckets, for instance, all you need to know is how that typically looks under the scenario you outlined above (distributing the items randomly to each bucket). That can be monstrously complicated to deal with theoretically, but it's not too hard to write some code to simulate that phenomenon. The basic idea is to code up the phenomenon you're interested in, then repeat it many, many times, and observe what happens across the many different simulations.
The process I'll outline is an instance of bootstrapping, and it's a great candidate for this problem because (1) this problem is conceptually very simple, but (2) it's analytically very tedious to actually work with. Here's some pseudocode for what bootstrapping might look like if you want to investigate the biggest bucket:
num_of_trials = 555555555 # repeat this process many times (see * below)
max_bucket[1:num_of_trials] = 0 # store a vector of all 0's to later record max bucket sizes
for i = 1 to num_of_trials
{
buckets[1:1000] = 0 # store a vector of all 0, of length 1000
for j = 1 to 100000 # j iterates the different balls
{
draw = random(1, 1000) # assign a random integer between 1 and 1000
buckets[draw]++ # increment the buckets vector in the position indicated by "draw"
}
max_bucket[i] = max(buckets) # write down highest bucket occupancy for this trial
}
At the end, you'll have a vector (max_bucket
) containing various different observed largest bucket size values. You could do a lot with this, such as making a histogram to see how the maximum bucket size generally looks. More importantly, you could find a threshold; you could establish what value constitutes the upper 5% of these values, and you could say that any max that comes out that high or higher is too extreme. Note that 5% is an arbitrary threshold, but it's also a common convention for these purposes.
Note the power of this approach; you could trivially modify it to investigate the smallest bucket, the gap between the largest and smallest bucket, or anything else that's easy to describe mathematically. This method gives you a quick way to focus on whatever thing you want to investigate and determine whether it's beyond the pale for typical results.
*Technical note: In reality, running 555,555,555 trials is probably too many for a typical computer; you'd want to start with something smaller and work up to whatever is a tolerable program runtime for you. More trials is better, but you'll run into long runtimes (and potentially memory errors) if you do too many. One quick way to check if you have enough trials is to run the above process twice and see if the results look terribly different from one another. If they do, you probably need more trials.
Since Cyan was already familiar with simulation approaches, I decided that my post was neither an answer nor particularly helpful. In hopes of completing it somewhat, here's some working R code:
num_of_trials <- 10000
max_values <- replicate(num_of_trials, 0)
for (i in 1:num_of_trials) {
max_values[i] <- max(table(sample(1:1000, 100000, replace = TRUE)))
}
hist(max_values)
quantile(max_values, 0.95)
Outputs:
The 95th percentile was 141, so a single bucket of size $\fbox{142 or higher}$ would raise suspicion if we set our significance level at the common value of $5\%$.
NB: I know that for()
loops are frowned upon in R, but when I attempted to do this as a single vectorized operation the resulting matrix was too large to store in memory. The runtime for this code on my average machine was about 6 minutes.
Best Answer
You could try a modified Poisson process. I will assume you are dividing up the continuous interval $[0,1]$ for simplicity, this can be discretized later.
There will be a series of borders, at positions $0=X_0<X_1<\dots<X_n<1$ to be randomly chosen. To the right of each border $X_i$ is a region whose kind is $n(i)$.
Associated to each kind $n$ is a parameter $\lambda_n$. These should be chosen so that
${\lambda_n^{-1}}/({\sum_{m=1}^N \lambda_m^{-1}})$ is equal to the desired "ratio" for the $n^\text{th}$ kind.
$\frac1N\sum_{m=1}^N\lambda_m$ is equal to the desired expected number of borders. More borders $\implies$ less cohesion.
Here is the process: for each $i=0,1,2,\dots$, $n(i)$ is uniformly randomly chosen from the set of $N-1$ kinds, all except for $n(i-1)$ (to prevent adjacent repeats). Then, $X_i$ is set to be $X_i + Y_i$, where $Y_i$ is exponentially distributed with parameter $\lambda_{n(i)}$. This continues until $X_{n+1}>1$ for some border $X_{n+1}$.