Solved – How to efficiently choose $n$ subset out of a set of $m$ many numbers, in a randomized uniform manner

algorithmscombinatoricsrandomnesssamplinguniform distribution

Problems:

It is fairly simple: we have a list of numbers $x_1, x_2, \ldots,x_n,\ldots, x_m$. Our goal is to randomly and uniformly choose a subset of $n$ many numbers out of these.

This means that, for any $i \in \{1,2,\ldots,n,\ldots,m\}$, the probability of choosing the value of $x_i$ must be:
$$
n\frac{1}{m} = \frac{n}{m}
$$

An algorithm:

An instruction is $\text{swap}(x_{\text{left}},x_{\text{right}})$. It simply swaps their values. I.e.

$t := x_{\text{left}}$
$x_{\text{left}} := x_{\text{right}}$
$x_{\text{right}} := t$

A suggested algorithm is: for each $i \in \{1,2,\ldots,n\}$, randomly and uniformly choose some $k_i \in \{1,2,\ldots,n, \ldots, m\}$, and then execute $\text{swap}(x_i, x_{k_i})$. Then return $x_1, x_2, \ldots, x_n$ as the set of $n$-many chosen numbers.

The challenge:

Intuitively, that algorithm looks to me to be perfectly fine. I see absolutely no problem in it.

But when I try to look at it mathematically, I fail to prove it. Instead, I actually seem to be prove that it is not a correct solution!

First, let's look at the probability of choosing the first $n$ numbers:

For any $i \in \{1,2,\ldots,n\}$, $x_i$ can be chosen if:

$x_i$ exists in the right hand of $\text{swap}$. There are exactly $n$ cases, including the case when $x_i$ exists in, both, the left and the right hands.
$x_i$ exists in the left hand, such that there exists $x_j$ in the right hand such that $j \in \{1,2,\ldots,n\}$. There are exactly $n$ such cases, one of which is the case when $i=j$ which we have counted earlier. Therefore, to avoid counting the case of $i=j$ twice, we assume that there are $n-1$ cases.

There is no other case where $x_i$ is chosen. Therefore the probability of choosing a number $x_i$, given that $i \in \{1,2,\ldots,n\}$ is:
$$
\frac{n + (n-1)}{m^2}
$$

Now, let's look at the probability of choosing the reset of the numbers up to $m$:

For any $i \in \{n+1,n+2,\ldots,m\}$, $x_i$ can be chosen if:

$x_i$ exists in the right hand, such that there exists $x_j$ in the right hand such that $j \in \{1,2,\ldots,n\}$. There are exactly $n$ such cases.

There is no other case of having $x_i$ chosen. So the probability of choosing $x_i$ is:
$$
\frac{n}{m^2}
$$

Comparing the probabilities:

That says that there is a bias. The probability of choosing the 1st $n$ numbers is higher than the probability of choosing the reset of the numbers up to the $m^{th}$

My question:

Where did I go wrong? How to address this problem?

Best Answer

The "suggested algorithm" is incorrect. One way to see why that is so is to count the number of equiprobable permutations performed by the algorithm. At each step there are $m$ possible values for $k$, whence after $n$ steps there are $m^n$ possible results. Although many results will be duplicated, the point is that the probability of outputting any particular permutation must be a multiple of $m^{-n}$. However, the correct probability, $1/m!$, is rarely such a multiple. For instance, with $m=3$ and $n=2$ you will produce $3^2=9$ possible permutations, each with chance $1/9$, but there are only $m!=6$ distinct permutations. Since $1/6$ is not a multiple of $1/9$, none of the permutations will be produced with a chance of $1/6$.

There are better ways. One is "Algorithm P" from Knuth's Seminumerical Algorithms, section 3.4.2:

for i := 1 to n do
    Swap(x[i], x[RandInt(i,m)])

The proof that this works is by induction on $m$.

Obviously it works in the base case (either $m=0$ or $m=1$, as you prefer).
In the first step, every $x_k$ has an equal chance of being moved into the first position. The algorithm proceeds to work recursively and independently on the elements in positions $2$ through $n$, where by the inductive hypothesis every one of those elements has an equal and independent chance of appearing anywhere among the first $n-1$ positions. Consequently all $m$ of the elements of $x$ have equal and independent chances of appearing among the first $n$ positions when the algorithm terminates, QED.

Knuth's Algorithm S guarantees the output will be in the order in which they originally appeared in $x$:

Select := n
Remaining := m
for i := 1 to m do
    if RandReal(0,1) * Remaining < Select then
        output x[i]
        Select--
    Remaining--

Exercises (2), (3), and (4) in that section ask the reader to prove this algorithm works. Once again the proof is an induction on $m$.

When $m=n=1$, $x$ itself is always returned.
Otherwise, $x_1$ will be output with probability $n/m$ in the first step, which is the correct probability, and the algorithm proceeds recursively to output $n-1$ elements of $x_{-1} = (x_2, x_3, \ldots, x_m)$ in sorted order if $x_1$ was output and otherwise will output $n$ elements of $x_{-1}$ in sorted order, QED.

If you would like to follow along, here is an executable version in R, along with a quick simulation to verify that all elements of $x$ have equal chances of being included.

algorithm.S <- function(x, n=1) {
  m <- length(x)
  #
  # Check input.
  #
  if (n < 0 || n > m) stop("Subset size out of range.")
  #
  # Handle special cases that R has trouble with.
  #
  if (m <= 1) 
    if (n==0) return (x[c()]) else return(x)
  #
  # The algorithm.
  #
  y <- x[1:n]
  select <- n
  remaining <- m
  j <- 0
  for (i in 1:m) {
    if (runif(1) * remaining < select) {
      j <- j+1; y[j] <- i
      select <- select-1
    }
    remaining <- remaining-1
  }
  return(y)
}

x <- 1:10
sim <- replicate(1e4, algorithm.S(x, 4))
hist(sim, breaks=seq(min(x)-1/2, max(x)+1/2, by=1))

Related Solutions

Sampling – Covariance for Sampling Without Replacement in Survey Sampling Studies

Problems in sampling from finite populations without replacement can usually be solved in terms of the sample inclusion probabilities $\pi(x)$, $\pi(x,y)$, etc.

Let $\pi(x) = \Pr(X_1 = x)$ for any $x$ in the population $\mathcal P$ (with $n=10$ elements) and let $\pi(x,y)=\Pr((X_1,X_2)=(x,y))$ for any $x$ and $y$ in $\mathcal P$. By definition of expectation,

$$E(X_1) = \sum_{x\in\mathcal P} \pi(x)x\tag{1}$$

and

$$E(X_1X_2) = \sum_{(x,y)\in\mathcal{P}^2} \pi(x,y)x y \tag{2}.$$

For this sampling procedure $X_1$ has equal chances of being any of the $n$ elements of $\mathcal P $, whence $$\pi(x)=\frac{1}{n}\tag{3}$$ for all $x$. Because sampling is without replacement, only the pairs $(x,y)$ with $x\ne y$ are possible, but all $n(n-1)$ of those are equally likely. Therefore

$$\pi(x,y) = \left\{\matrix{\frac{1}{n(n-1)} & x\ne y \\ 0 & x=y} \right.\tag{4}$$

That's the general result. For any particular population, you just have to do the arithmetic implied by formulae $(1)$ through $(4)$.

Suppose now that $\mathcal{P} = \{1,2,\ldots, n\}$. Formulae $(1)$ and $(3)$ give

$$E(X_1) = \sum_{i=1}^{n} \frac{1}{n} i = \frac{n+1}{2}$$

while formulae $(2)$ and $(4)$ give

$$\eqalign{E(X_1X_2) &= \sum_{i,j=1;\, i\ne j}^{n} \frac{1}{n(n-1)} i j \\ &= \frac{1}{n(n-1)}\left(\sum_{i=1}^{n}\sum_{j=1}^{n} i j - \sum_{i=1}^{n} i^2\right)\\ &= \frac{1}{n(n-1)}\left(\sum_{i=1}^{n}i\ \sum_{j=1}^{n} j - \sum_{i=1}^{n} i^2\right)\\ &= \frac{1}{n(n-1)}\left(\left(\frac{n(n+1)}{2}\right)^2 - \frac{n(1+n)(1+2n)}{6}\right) \\ &= \frac{3n^2 + 5n + 2}{12}. }$$

Because there is no distinction among any of the $X_i$, these results hold for any $i \ne j$, not just $i=1$ and $j=2$. In particular,

$$\operatorname{Cov}(X_i,X_j) = E(X_iX_j) - E(X_i)E(X_j) = \frac{3n^2 + 5n + 2}{12} - \left(\frac{n+1}{2}\right)^2 = -\frac{n+1}{12}.$$

When $n=10$, the covariance of $X_i$ and $X_j$ is $-11/12 \approx -0.917$. As a check, here is a simulation of a million such samples (using R):

> cov(t(replicate(1e6, sample.int(10, 5))))

The output is the $5\times 5$ covariance matrix of $(X_1, \ldots, X_5)$. Because this is a simulation the output is random; but because it's a largish simulation, it's reasonably stable from one run to the next. In the first simulation I did, the off-diagonal elements of this covariance matrix ranged from $-0.9277$ to $-0.9080$ with a mean of $-0.9169$: narrowly spread around $-11/12$ as one would expect.