Probability – Distribution for Distances Between Random Integers

probabilityprobability distributions

Suppose I pick 'N' integers over an interval [A, B] without replacement. As a function of 'N' and the interval length, what distribution / average values should I expect for the distances between nearest-neighbors in a sorted array of the selected integers?

Edit: I apologize, an important note is that the distances between the endpoints and the nearest integers to the endpoints should also be included. This is a bit like dividing a piece of rope into (B – A + 1) segments, cutting at the locations representing the 'N' selected integers, and looking at the distribution of cut rope lengths.

Edit 2: Apparently this question is in desperate need of clarification. Extending the rope example I provided, here's exactly what I'm looking for:

Upon cutting the rope into 'N' pieces, and placing these pieces in a bag, I would very much like the probability, P(k), of randomly selecting a fragment of rope of length 'k' from this bag. Here, the probability of selecting a particular fragment of the rope is independent of its length. The function for P(k) provides what I'd like to know about the distribution of rope lengths after 'N' cuts.

Best Answer

Let $X_{(1)}, \ldots, X_{(N)}$ be the chosen integers in increasing order (the order statistics). For simplicity I'll suppose $A = 1$. Of course we must have $B \ge N$. Then I claim that all the "gaps" $X_{(j+1)} - X_{(j)}$ as well as $B+1 - X_{(N)}$ and $X_{(1)} - 0$ have expected value $(B+1)/(N+1)$.

Note that $E[X_{(1)} | X_{(2)}] = X_{(2)}/2$, because given $X_{(2)} = x$, $X_{(1)}$ is equally likely to be any of the integers 1 to $x-1$. Thus $E[X_{(1)}] = E[X_{(2)} - X_{(1)}]$. Similarly, given $X_{(j)} = x$ and $X_{(j+2)} = y$, $X_{(j+1)}$ is equally likely to be any of the integers $x+1$ to $y-1$, so $E[X_{(j+2)} - X_{(j+1)}] = E[X_{(j+1)} - X_{(j)}]$. Similarly, $E[B+1-X_{(N)}] = E[X_{(N)} - X_{(N-1)}]$. Thus all $N+1$ gaps have the same expected value, and since they add up to $B+1$ that expected value is $(B+1)/(N+1)$.

Related Solutions

Probability distribution of the number of red balls with $k$ green balls as nearest neighbors

Consider some large number of balls, $N$, and index the balls by $j=1,2, \dots, N$. Also consider some fixed $k \in \{0, 1, 2, 3\}$.

Define the indicator random variable $I_{k,j}$ to be $1$ if ball $j$ has $k$ green neighbors, and $0$ otherwise. Note that $E[I_{k,j}] = {3 \choose k} p^k (1-p)^{3-k}.$ We now have:

$$X_k = \frac{S}{N} \,\,\,\,\,\,\,\,\text{where}\,\,\,\,\,\,\, S=\sum_{j=1}^N I_{k,j}.$$

You are correct that since $I_{k,j}$ are dependent, the sum $S$ is not Bernoulli. However, the dependence is so limited that the laws of large numbers still apply. In this example, you can partitioning the red balls into a small finite number of subsets s.t. no two red balls in the same subset are dependent.

One way to do the partitioning: First, notice that each red ball is dependent with only its $6$ nearest red neighbors. So do this: for every "odd-numbered" row of reds, color them alternately with $2$ different shades of red, and for every "even-numbered" row, color those red balls alternately with yet another $2$ different shades of red. Then with a total of $4$ different shades you have colored the reds s.t. two red balls sharing a blue/green neighbor have different shades of red. You have also broken up:

$$S=S_1 + S_2 + S_3 + S_4$$

where each $S_i$ represents the sum of $I_{k,j}$ in subset $i$ (that particular shade of red). Now it becomes true that each $S_i$ is a sum of i.i.d. variables and so $S_i/(N/4) \to E[I_{k,j}].$ When you sum them back up, you also get $X_k \to E[I_{k,j}]$.

BTW $S_i$ is now Bernoulli, but that fact is not really necessary. All we need is that $S_i$ is a sum of i.i.d. variables and the laws of large numbers apply.

Basically the laws of large numbers apply to cases where the variables being summed are dependent only to a small extent. The above is one example, but you can find more in the following, which I found after a few minutes of googling, and you may like these proofs or insights better.

Weak Law of Large Numbers for Dependent Random Variables with Bounded Covariance

https://mathoverflow.net/questions/2409/strong-law-of-large-numbers-for-weakly-dependent-random-variables

Definition/statistics/calculations for “evenly spaced numbers”

I think @DavidDStork's comment is an idea worth considering; I would use the standard deviation instead of the variance. [See the note at the end for another possibility.]

If you have only odd numbers this 'index of evenness' will be zero. For less 'even' sequences this measure will be greater. [Computations in R.]

x = seq(1, 100, by= 2);  x
 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25
[14] 27 29 31 33 35 37 39 41 43 45 47 49 51
[27] 53 55 57 59 61 63 65 67 69 71 73 75 77
[40] 79 81 83 85 87 89 91 93 95 97 99
diff(x)
 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[21] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[41] 2 2 2 2 2 2 2 2 2
sd(diff(x))
[1] 0

If the 50 numbers are chosen at random without replacement, you might wonder about the distribution of this measure. How small does the measure have to be in order for the sequence to be called 'relatively even'?

The following simulation shows the distribution for 100,000 such random choices:

set.seed(2020)
sd.dif = replicate(10^5, sd(diff(sort(sample(1:100, 50)))))
summary(sd.dif)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8165  1.2410  1.3534  1.3680  1.4777  2.9226 

hist(sd,dif, prob=T, col="skyblue2")

So if you get an 'index of evenness' below $1$ you can be pretty sure the sequence of 50 wasn't randomly chosen.

Here are three randomly generated sequences and their indexes of smoothness:

a = sort(sample(1:100, 50)); a
 [1]   1   2   4   5   8   9  13  17  19  22
[11]  26  27  28  29  30  32  33  34  35  40
[21]  44  49  50  52  53  57  60  61  63  64
[31]  68  70  71  72  74  78  80  82  84  85
[41]  86  87  89  93  94  96  97  98  99 100
sd(diff(a))
[1] 1.266389

b = sort(sample(1:100, 50)); b
 [1]   1   2   3   4   5  10  13  14  15  16
[11]  17  21  25  27  28  33  34  35  36  38
[21]  41  42  43  44  48  55  57  59  60  62
[31]  63  64  67  68  69  70  71  72  75  76
[41]  79  84  86  92  93  94  95  96  99 100
sd(diff(b))
[1] 1.547711

c = sort(sample(1:100, 50)); c
 [1]   1   4   5   6   9  10  15  17  19  21
[11]  22  24  25  27  29  32  33  36  38  39
[21]  40  42  43  44  47  48  50  52  56  57
[31]  60  63  64  65  68  69  73  75  79  80
[41]  81  82  84  85  86  96  97  98  99 100
sd(diff(c))
[1] 1.561113

By Stork's criterion, sequence a is 'smoother' than c. If we plot the sequences below, then the diagonal line is smoother for a. [Note the big jump in the plot of c.]

par(mfrow=c(1,2))
 plot(a, pch=20); abline(0, 2, col="green2")
 plot(c, pch=20); abline(0, 2, col="green2")
par(mfrow=c(1,2))

Note: The plots may show another useful way to define 'smoothness'. How nearly to they match a straight line? One way to answer that is by looking at the correlation of the sequence with the vector of numbers from 1 through 50. A correlation nearer to $1$ indicates better fit to a straight line. By this criterion, c is smoother than a. [Note the wobble in the plot of a.]

So you need to think about which criterion best matches your personal view of 'smoothness'.

cor(1:50, a)
[1] 0.9974012
cor(1:50, c)
[1] 0.9980791

Best Answer

Related Solutions

Probability distribution of the number of red balls with $k$ green balls as nearest neighbors

Definition/statistics/calculations for “evenly spaced numbers”

Related Question