Desired specification for algorithm:
Given an input array of positive integers $x$, with length $n$, return an integer $i \in \{0, \dots, n-1 \}$ according to the probability distribution:
$$P(i)=\frac{x_i}{\operatorname{sum}(x)}$$
First, I should note that many libraries have built in functions that do that, such as the numpy.random.choice
function in Python's NumPy library.
numpy.random.choice(np.arange(len(x)), p=x/numpy.sum(x))
will meet the specification. The keyword argument p
is for passing in the desired probabilities.
To code an algorithm from 'scratch', given only some random number generator that can generate a random integer from some range, with uniform probability, one possible algorithm is:
Given an input array of positive integers $x$
Let $c$ be the array of cumulative sums.
Let $s=\operatorname{sum}(x)$.
Choose an integer $r$ with uniform probability from 0 to $s-1$, inclusive.
Let $i$ be the count of the entries in $c$ that are less than or equal to $r$.
Return $i$.
This would return an index $i$ from 0 to $\operatorname{length}(x)-1$ in such a way that the probability is given by: $$P(i)=\frac{x_i}{\operatorname{sum}(x)}$$
For example, if the input array is $x=[2,6,9,3,1,3]$, then $c=[2,8,17,20,21,24]$ and $s=24$.
The algorithm would choose an integer $r$ with uniform probability from 0 to 23, inclusive.
If $r$ is 0 or 1, then $i=0$. (probability 2/24)
If $r$ is 2,3,4,5,6 or 7, then $i=1$. (probability 6/24)
If $r$ is 8,9,10,11,12,13,14,15 or 16, then $i=2$. (probability 9/24)
If $r$ is 17,18 or 19, then $i=3$. (probability 3/24)
If $r$ is 20, then $i=4$. (probability 1/24)
If $r$ is 21,22 or 23, then $i=5$. (probability 3/24)
The intuition for the algorithm comes from the concept of inverse transform sampling in probability theory. Given a probability distribution in the form of a cumulative distribution function $F(x)$, you can generate random variables according to that distribution by generating a uniform random variable $u$ from 0 to 1, then choosing the smallest $x$ such that $F(x) \ge u$.
For this problem, in which the distribution is over a finite set of values, the probabilities are the values in the array $x$, divided by $\operatorname{sum}(x)$ so that the total probability is 1.
The cumulative distribution function is just the cumulative sum of those probabilities.
If we use our RNG to generate a random integer $r$ from 0 to $\operatorname{sum}(x)-1$, then it already chooses each integer in that range with probability $1/\operatorname{sum}(x)$, so we don't need to divide by $\operatorname{sum}(x)$. We just need to find the smallest index $i$ such that $F(x_i)\ge r/\operatorname{sum}(x)$
We can actually find that index by counting the number of entries in the $c$ which are greater than or equal to $r$. If $i$ entries in $c$ are greater than or equal to $r$, then $F(x_i)\ge r/\operatorname{sum}(x)$ but $F(x_{i-1})< r/\operatorname{sum}(x)$ (or $i=0$)
Best Answer
Given $n$ as a parameter, define a new parameter $ r\equiv \lceil \log_2 n \rceil$. For example, when $n = 8$ we have $r = 3$, and if $n = 10$ one will have seen at least $r = 4$ rounds at termination.
Denote the probability "the input size gets halved" as $p$. Here $p = 1/2$. Denote $X$ as the "total number of rounds when the algorithm terminates", then $X$ follows a Negative-Binomial distribution, with the following probability mass function:
$$\Pr\left\{ X = k\right\} = { k-1\choose r-1}p^r(1-p)^{k-r}~, \qquad k = r, r+1, r+2, \ldots$$ This is a standard discrete distribution, and it's expectation is well-known:
$$\mathbb{E}[X] = \frac{r}p = \frac{\lceil \log_2 n \rceil}p = 2\lceil \log_2 n \rceil$$ The above is the requested answer.
Below is a clarification in case of confusion.
There are at two common parametrizations of the Negative-Binomial distribution, besides simply switching $p \leftrightarrow q$ (and labeling which one as the "success"). Please be careful with the definition when reading materials from various sources.
Whenever you see the expectation expressed as $\mathbb{E}[Y] = r(1-p)/p$, it is parametrized as $$Y \equiv \text{number of (rounds of) failure till the termination at the $r$ th success}$$ as opposed to here $X$ being the total number of rounds. Namely, $X = r+Y$ which means $$\mathbb{E}[X] = \mathbb{E}[r+Y] = r+\mathbb{E}[Y] = r+\frac{r(1-p)}p =\frac{r}p$$ that is necessarily consistent.