Split interval into random regions

algorithmsrandom

I'm having troubles comming up with algorithm, that would split interval of integers (lets say from 0 to 1000000) with these parameters :

There are N kinds of regions. Neighboring regions cannot have same kind, otherwise, they would be considered same region. N is going to be relatively small compared to whole interval, in this of interval 0-1000000 it won't be higher than 10.
Each kind has "ratio", where ratios of all kinds add up to 1. This If I were to average total size of each region of single kind across multiple execution, I would expect them to have this ratio.
"cohesion" for either whole interval or specific kinds (not sure which is possible). If the "cohesion" is high, then regions would be big, continuous blocks. If the "cohesion" is low, then regions would be small pieces strewn all around.

The total number of regions is not set beforehand. It is emergent based on kind number and cohesion.

I know how to split interval into regions, but I don't know how to apply above parameters to control it's output.

Example:

The ratio parameter would cause, if there were 3 kinds of regions with ratios of 50% A, 40% B and 10% C, then I would expect, for interval from 0 to 10, something like BBBBCAAAAA or ABBACAABAB depending on cohesion. That means if I were to add up all A, B and C, I would get the above ratios, on average.

Lets say we have interval from 0 to 10 and 3 kinds (A, B, C) and low cohesion. I would expect something like : ABACCBBABC . If instead cohesion was high, I would expect BBBBBAACCC.

In above example, characters of same kind would form single region.

Another way to look at it. Imagine the algorithm would randomly generate what kind each number in the region is, based on the defined ratios. Then intervals with same kind would be aggregated into regions. But that would result in low cohesion, as each region would be 1-2 values wide. Instead, I want regions to be big areas. Eg. have high cohesion.

Image to better illustrate what I mean by "cohesion". N is 3 for Red, Green, Blue and interval is 0-50:

Low cohesion, produces lots of small, scattered regions. High cohesion creates few, big regions.

Best Answer

You could try a modified Poisson process. I will assume you are dividing up the continuous interval $[0,1]$ for simplicity, this can be discretized later.

There will be a series of borders, at positions $0=X_0<X_1<\dots<X_n<1$ to be randomly chosen. To the right of each border $X_i$ is a region whose kind is $n(i)$.

Associated to each kind $n$ is a parameter $\lambda_n$. These should be chosen so that

${\lambda_n^{-1}}/({\sum_{m=1}^N \lambda_m^{-1}})$ is equal to the desired "ratio" for the $n^\text{th}$ kind.
$\frac1N\sum_{m=1}^N\lambda_m$ is equal to the desired expected number of borders. More borders $\implies$ less cohesion.

Here is the process: for each $i=0,1,2,\dots$, $n(i)$ is uniformly randomly chosen from the set of $N-1$ kinds, all except for $n(i-1)$ (to prevent adjacent repeats). Then, $X_i$ is set to be $X_i + Y_i$, where $Y_i$ is exponentially distributed with parameter $\lambda_{n(i)}$. This continues until $X_{n+1}>1$ for some border $X_{n+1}$.

Related Solutions

[Math] Split a number into n numbers

To distribute the number $n$ into $p$ parts, you would calculate the “truncating integer division” of $n$ divided by $p$, and the corresponding remainder.

Mathematically that is (assuming that both $n$ and $p$ are strictly positive integers) $$ d = \left\lfloor \frac np \right\rfloor \, , \quad r = n \bmod p = n-pd \, . $$

In many programming languages you would do something like

int d = n / p; // truncating integer division
int r = n % p; // remainder

Then $$n = pd + r = r(d+1) + (p-r)d $$ so that the desired partition is $$ \underbrace{d+1, \ldots, d+1}_{r \text{ times}}, \underbrace{d, \ldots, d}_{p-r \text{ times}} $$

Buckets of random numbers

Like Arthur said, the thing you're looking for is a $p$-value; the apparatus that will give it is a hypothesis test.

There are lots of ways that you can do this. The chi-squared test, suggested in the comments, is a good place to start; see also the Multinomial test. These are good tools to measure the overall "randomness" of a sample and give basic tools to tell whether some observed sample was or was not likely to have come from a purported definition. However, I'll admit that there's a fairly high barrier of entry on these articles, and they're not necessarily great introductions to the subject of hypothesis testing.

There's another workaround to get what you're asking for, and it may very well be simpler if you're not already well-versed in the language of hypothesis tests. If you want to specifically focus on the maximum number of objects in the same buckets, for instance, all you need to know is how that typically looks under the scenario you outlined above (distributing the items randomly to each bucket). That can be monstrously complicated to deal with theoretically, but it's not too hard to write some code to simulate that phenomenon. The basic idea is to code up the phenomenon you're interested in, then repeat it many, many times, and observe what happens across the many different simulations.

The process I'll outline is an instance of bootstrapping, and it's a great candidate for this problem because (1) this problem is conceptually very simple, but (2) it's analytically very tedious to actually work with. Here's some pseudocode for what bootstrapping might look like if you want to investigate the biggest bucket:

num_of_trials = 555555555         # repeat this process many times (see * below)
max_bucket[1:num_of_trials] = 0   # store a vector of all 0's to later record max bucket sizes
for i = 1 to num_of_trials
{
  buckets[1:1000] = 0             # store a vector of all 0, of length 1000
  for j = 1 to 100000             # j iterates the different balls
  {
    draw = random(1, 1000)        # assign a random integer between 1 and 1000
    buckets[draw]++               # increment the buckets vector in the position indicated by "draw"
  }
max_bucket[i] = max(buckets)      # write down highest bucket occupancy for this trial
}

At the end, you'll have a vector (max_bucket) containing various different observed largest bucket size values. You could do a lot with this, such as making a histogram to see how the maximum bucket size generally looks. More importantly, you could find a threshold; you could establish what value constitutes the upper 5% of these values, and you could say that any max that comes out that high or higher is too extreme. Note that 5% is an arbitrary threshold, but it's also a common convention for these purposes.

Note the power of this approach; you could trivially modify it to investigate the smallest bucket, the gap between the largest and smallest bucket, or anything else that's easy to describe mathematically. This method gives you a quick way to focus on whatever thing you want to investigate and determine whether it's beyond the pale for typical results.

*Technical note: In reality, running 555,555,555 trials is probably too many for a typical computer; you'd want to start with something smaller and work up to whatever is a tolerable program runtime for you. More trials is better, but you'll run into long runtimes (and potentially memory errors) if you do too many. One quick way to check if you have enough trials is to run the above process twice and see if the results look terribly different from one another. If they do, you probably need more trials.

Since Cyan was already familiar with simulation approaches, I decided that my post was neither an answer nor particularly helpful. In hopes of completing it somewhat, here's some working R code:

num_of_trials <- 10000
max_values <- replicate(num_of_trials, 0)
for (i in 1:num_of_trials) {
  max_values[i] <- max(table(sample(1:1000, 100000, replace = TRUE)))
}
hist(max_values)
quantile(max_values, 0.95)

Outputs: The 95th percentile was 141, so a single bucket of size $\fbox{142 or higher}$ would raise suspicion if we set our significance level at the common value of $5\%$.

NB: I know that for() loops are frowned upon in R, but when I attempted to do this as a single vectorized operation the resulting matrix was too large to store in memory. The runtime for this code on my average machine was about 6 minutes.

Best Answer

Related Solutions

[Math] Split a number into n numbers

Buckets of random numbers

Related Question