Estimating the sum of subset of positive integers

integersprobabilitysummation

Suppose we have a sequence of n positive integers, each sampled from the pmf: p(c)
$${c_1, c_2, c_3,…,c_n}$$
$${c_i \in \mathbb{N}}$$

The sum of these integers is a known constant k
$$\sum_{i=1}^{n} c_i = k$$

Hence, the mean of the sequence is:

$$\bar{c} = \frac{\sum_{i=1}^{n} c_i}{n} = \frac{k}{n}$$

Now, the first m numbers are summed to give x
$$\sum_{i=1}^{m} c_i = x$$

And this sum should be approximately equal to m lots of the mean

$$\frac{\sum_{i=1}^{m} c_i}{m} = \frac{x}{m} \approx \bar{c} \implies x \approx m\bar c$$

How can we estimate x (with a constant times this product), such that the estimate is both as close to the actual value as possible, but also below the actual with probability at least p
$$Pr(\alpha m \bar c<x)\geq p$$
$$\alpha \in [0, 1]$$

My first thought is that if alpha is 1, this will be an overestimate with p = 0.5

Only an approximate answer is required, no correct closed form answer is needed.

Any help is greatly appreciated!

Best Answer

Since you're happy with a normal approximation, you can replace the $c_i$ by i.i.d. continuous Gaussian random variables with mean $\mu$ and variance $\sigma^2$ determined from the PMF of the $c_i$. Then you're looking for the conditional probability distribution of the sum of the first $m$ of these variables, given that all $n$ of them sum to $k$. Let $S_j$ denote the sum of the first $j$ variables. Then

\begin{eqnarray*} f_{S_m}(S_m\mid S_n=k) &=&\frac{f_{S_m,S_n}(S_m,k)}{f_{S_n}(k)} \\ &=& \frac{f_{S_m,S_{n-m}}(S_m,k-S_m)}{f_{S_n}(k)} \\ &=& \frac{f_{S_m}(S_m)f_{S_{n-m}}(k-S_m)}{f_{S_n}(k)} \\ &=& \frac1{\sqrt{2\pi\sigma^2}}\frac n{m(n-m)}\exp\left(-\frac{(S_m-m\mu)^2}{2m\sigma^2}-\frac{(k-S_m-(n-m)\mu)^2}{2(n-m)\sigma^2}+\frac{(k-n\mu)^2}{2n\sigma^2}\right) \\ &=& \frac1{\sqrt{2\pi\sigma'^2}}\exp\left(-\frac{(S_m-\mu')^2}{2\sigma'^2}\right), \end{eqnarray*}

again a normal distribution, with mean

$$ \mu'=\frac mnk $$

(as you expected) and variance

$$ \sigma'^2=\frac1{\frac1m+\frac1{n-m}}\sigma^2=\frac{m(n-m)}n\sigma^2\;, $$

which is zero for $m=0$ or $m=n$ (since in these cases we know the sum to be $0$ and $k$, respectively) and takes its maximal value $\frac n4\sigma^2$ for $m=\frac n2$, i.e. half of the variance $\frac n2\sigma^2$ of the unconditional distribution of $S_{\frac n2}$.

Your constant $\alpha$ is then determined from

$$ P(S_m\gt\alpha\mu')=\frac12\left(1-\operatorname{erf}\left(\frac{\alpha\mu'-\mu'}{\sigma'\sqrt2}\right)\right)=\frac12\left(1-\operatorname{erf}\left((\alpha-1)k\sqrt{\frac m{2n(n-m)}}\right)\right)\;; $$

setting this equal to $p$ and solving for $\alpha$ yields $\alpha$ in terms of the inverse error function $\operatorname{erf}^{-1}$:

$$ \alpha=1+\frac1k\sqrt{\frac{2n(n-m)}m}\operatorname{erf}^{-1}(1-2p)\;. $$

As you expected, $\alpha=1$ for $p=\frac12$.

Related Solutions

Is the sum of binomial coefficients over square free integers normally distributed

This is an extended comment rather than a complete answer.

I understand that you want to find the limiting distribution. Below are the results for a maximum $n$ of 10,000 (along with the associated Mathematica code):

(* Generate data and moments *)
nMax = 10000;
\[Mu] = Table[MoebiusMu[i]^2, {i, nMax}];
s[n_] := Zeta[2] Sum[MoebiusMu[i]^2 Binomial[n, i]/2^n, {i, nMax}]
data = Table[{n, s[n]}, {n, 1, nMax}];
moments = Table[{n, Mean[data[[Range[n], 2]] // N],
    StandardDeviation[data[[Range[n], 2]] // N],
    Skewness[data[[Range[n], 2]] // N],
    Kurtosis[data[[Range[n], 2]] // N]}, {n, 2, nMax}];

I've generated the mean, standard deviation, skewness, and kurtosis values for $n=2$ through $n=10000$. If the limiting (or approximating distribution function) is normal, then the skewness should settle towards zero and the kurtosis settle towards 3. Here are the resulting figures:

ListPlot[{data, {{1, 1}, {nMax, 1}}}, Joined -> True, 
 AspectRatio -> 1/4,
 ImageSize -> 1000, Frame -> True, 
 FrameLabel -> (Style[#, Bold, 18] &) /@ {"n", 
    "\[Zeta](2)s(n)/\!\(\*SuperscriptBox[\(2\), \(n\)]\)"},
 PlotStyle -> Thickness[0.005], ImagePadding -> 50, PlotRange -> All]
plotIt[m_, label_, level_] := 
 ListPlot[{moments[[All, {1, m}]], {{2, level}, {nMax, level}}},
  Joined -> True, PlotRange -> All, Frame -> True, 
  FrameLabel -> (Style[#, Bold, 18] &) /@ {"n", label},
  AspectRatio -> 1/4, PlotStyle -> Thickness[0.005], 
  ImagePadding -> 50, PlotRange -> All, ImageSize -> 1000]
plotIt[2, "Mean", 1]
plotIt[3, "Standard deviation", 0.01078]
plotIt[4, "Skewness", 0]
plotIt[5, "Kurtosis", 3]

While the above figures don't rule out a normal distribution (or that a normal distribution might provide a reasonable approximation for the proportion of numbers between any two specified values), that the skewness does not seem to be approaching zero and that the kurtosis is drifting farther away from 3 does not support a normal distribution as the limiting distribution. Maybe a slightly skewed and heavier-tailed distribution might be a better candidate for the limiting distribution.

From other posts I get the impression that you have values up to $n=44,000$. Similar figures as above might also be suggestive with that larger data set.

Let $X_1,X_2,…,X_{100}$ be a random sample from a distribution with pdf $f(x) = \cfrac{4}{3} – x^2, 0 \leq x \leq 1$

I agree with your calculations.

As a further check, here is a simulation in R:

set.seed(2020)
cases <- 10^7
samplesize <- 100
densitytest <- function(x,t){ return(t*4/3 < 4/3 - x^2) }
dat <- runif(cases)
testdat <- runif(cases)
okdat <- dat[densitytest(dat,testdat)]
okdat <- okdat[1:(samplesize * floor(length(okdat)/samplesize))]
mean(okdat)
# 0.4166633
sd(okdat) 
# 0.2661291
matdat <- matrix(okdat, ncol=samplesize)
meandat <- rowMeans(matdat)
mean(meandat)
# 0.4166633
sd(meandat)
# 0.02664449
mean(0.4 < meandat & meandat < 0.5)
# 0.7317877

which is close to (though slightly lower than) your normal approximation

Best Answer

Related Solutions

Is the sum of binomial coefficients over square free integers normally distributed

Let $X_1,X_2,…,X_{100}$ be a random sample from a distribution with pdf $f(x) = \cfrac{4}{3} – x^2, 0 \leq x \leq 1$

Related Question