Estimating the sum of subset of positive integers

integersprobabilitysummation

Suppose we have a sequence of n positive integers, each sampled from the pmf: p(c)
$${c_1, c_2, c_3,…,c_n}$$
$${c_i \in \mathbb{N}}$$

The sum of these integers is a known constant k
$$\sum_{i=1}^{n} c_i = k$$

Hence, the mean of the sequence is:

$$\bar{c} = \frac{\sum_{i=1}^{n} c_i}{n} = \frac{k}{n}$$

Now, the first m numbers are summed to give x
$$\sum_{i=1}^{m} c_i = x$$

And this sum should be approximately equal to m lots of the mean

$$\frac{\sum_{i=1}^{m} c_i}{m} = \frac{x}{m} \approx \bar{c} \implies x \approx m\bar c$$

How can we estimate x (with a constant times this product), such that the estimate is both as close to the actual value as possible, but also below the actual with probability at least p
$$Pr(\alpha m \bar c<x)\geq p$$
$$\alpha \in [0, 1]$$

My first thought is that if alpha is 1, this will be an overestimate with p = 0.5

Only an approximate answer is required, no correct closed form answer is needed.

Any help is greatly appreciated!

Best Answer

Since you're happy with a normal approximation, you can replace the $c_i$ by i.i.d. continuous Gaussian random variables with mean $\mu$ and variance $\sigma^2$ determined from the PMF of the $c_i$. Then you're looking for the conditional probability distribution of the sum of the first $m$ of these variables, given that all $n$ of them sum to $k$. Let $S_j$ denote the sum of the first $j$ variables. Then

\begin{eqnarray*} f_{S_m}(S_m\mid S_n=k) &=&\frac{f_{S_m,S_n}(S_m,k)}{f_{S_n}(k)} \\ &=& \frac{f_{S_m,S_{n-m}}(S_m,k-S_m)}{f_{S_n}(k)} \\ &=& \frac{f_{S_m}(S_m)f_{S_{n-m}}(k-S_m)}{f_{S_n}(k)} \\ &=& \frac1{\sqrt{2\pi\sigma^2}}\frac n{m(n-m)}\exp\left(-\frac{(S_m-m\mu)^2}{2m\sigma^2}-\frac{(k-S_m-(n-m)\mu)^2}{2(n-m)\sigma^2}+\frac{(k-n\mu)^2}{2n\sigma^2}\right) \\ &=& \frac1{\sqrt{2\pi\sigma'^2}}\exp\left(-\frac{(S_m-\mu')^2}{2\sigma'^2}\right), \end{eqnarray*}

again a normal distribution, with mean

$$ \mu'=\frac mnk $$

(as you expected) and variance

$$ \sigma'^2=\frac1{\frac1m+\frac1{n-m}}\sigma^2=\frac{m(n-m)}n\sigma^2\;, $$

which is zero for $m=0$ or $m=n$ (since in these cases we know the sum to be $0$ and $k$, respectively) and takes its maximal value $\frac n4\sigma^2$ for $m=\frac n2$, i.e. half of the variance $\frac n2\sigma^2$ of the unconditional distribution of $S_{\frac n2}$.

Your constant $\alpha$ is then determined from

$$ P(S_m\gt\alpha\mu')=\frac12\left(1-\operatorname{erf}\left(\frac{\alpha\mu'-\mu'}{\sigma'\sqrt2}\right)\right)=\frac12\left(1-\operatorname{erf}\left((\alpha-1)k\sqrt{\frac m{2n(n-m)}}\right)\right)\;; $$

setting this equal to $p$ and solving for $\alpha$ yields $\alpha$ in terms of the inverse error function $\operatorname{erf}^{-1}$:

$$ \alpha=1+\frac1k\sqrt{\frac{2n(n-m)}m}\operatorname{erf}^{-1}(1-2p)\;. $$

As you expected, $\alpha=1$ for $p=\frac12$.

Related Question