Probability – Concentration Inequalities for Random Sampling Without Replacement

co.combinatoricsmeasure-concentrationpr.probabilityprobability distributionsst.statistics

Let a population $C$ consist of $N$ values $c_1, c_2, \cdots, c_N$, with $c_i\in \{0,1\}$. Let $X_1, X_2, \cdots, X_n$ denote a random sample without replacement from $C$ and let $Y_1, Y_2, \cdots, Y_n$ denote a random sample with replacement from $C$. The random variables $Y_1, \cdots, Y_n$ are independent and identically distributed with mean $\mu$ and variance $\sigma^2$, where
$$
\mu=\frac{1}{N} \sum_{i=1}^N c_i, \quad \sigma^2=\frac{1}{N} \sum_{i=1}^N\left(c_i-\mu\right)^2
$$

Chernoff give upper bounds for $\Pr\{\vec{Y}-\mu \geq t\}$, where $\bar{Y}=\left(Y_1+\cdots+Y_n\right) / n$. Hoeffding proved (Theorem 4 of this paper) that the same bounds are upper bounds for $\Pr\{\bar{X}-\mu \geq t\}$, where $\bar{X}=\left(X_1\right.$ $\left.+\cdots+X_n\right) / n$.

Question

$\text { In the case above we have } E \bar{X}=E \bar{Y}=\mu \text { but } \operatorname{Var} \bar{X}=\frac{N-n}{N-1} \frac{\sigma^2}{n}<\frac{\sigma^2}{n}=\operatorname{Var} \bar{Y}$.

Doesn't variance already show how concentrated the random variables are around the mean? Why isn't it obvious that the concentration inequalities for $\bar{Y}$ would also hold for $\bar{X}$?

Now imagine instead of $n$ samples with replacement, we made $n'$ samples with replacement, for some $n'>n$. In this case, $E \bar{X}=E \bar{Y}=\mu \text, \operatorname{Var} \bar{X}=\frac{N-n}{N-1} \frac{\sigma^2}{n}, \operatorname{Var} \bar{Y}=\frac{\sigma^2}{n'}$. If we maximise $n'$ such that $\operatorname{Var} \bar{X}<\operatorname{Var} \bar{Y}$ still holds, that would yield a tighter concentration inequality.

EDIT

The following figure plots size of the sampled block vs the probability that the mean of the sample deviates from the mean of the population by a certain amount. The dots are the result from simulation. The blue line is from using Additive chernoff with $n'=n (N-1)/(N-n)$ and orange is from Serfling.

Heres the Mathematica code:

generateRandomList[n_, m_] := 
  RandomSample[Join[ConstantArray[1, m], ConstantArray[0, n - m]]];

fractionGreaterThan[list_, threshold_] := 
  Count[list, x_ /; x >= threshold]/Length[list];


kldFunc[x_, y_] := 
  x Log[x/y] + (1 - 
      x) Log[(1 - x)/(1 - 
        y)];(*KL divergence for calculating the chernoff bound*)

n = 10^4;(*size of the population*)
p = 1 10^-1;(*Fraction of 1s in the the population*)
e = 10^-2; (*deviation from the expected value*)
biglist = generateRandomList[n, n p];
Show[{Plot[{Exp[-kldFunc[p + e, p] k ((n - 1)/(n - k))], 
    Exp[-2 k/(1 - (k - 1)/n) (e)^2]}, {k, 1, n}, 
   PlotLegends -> {"Additive Chernoff - Modified", "Serf"}, 
   AxesLabel -> {"Sampled size", "Probability"}], 
  ListPlot[{Table[{k, fractionGreaterThan[Table[
        smallist = RandomSample[biglist, k];
        Total[smallist]/k, {i, 1000}], p + e]}, {k, 1, n, 100}]}, 
   PlotRange -> Full, PlotLegends -> {"Exact value"}]}]

Best Answer

$\newcommand\E{\operatorname{E}}\newcommand\var{\operatorname{Var}}\newcommand\si{\sigma}$This will not work. E.g., if $N=10$, $\{c_1,\dots,c_{10}\}=\{-1, -1, -1, -1, -1, 1, 1, 1, 1, 1\}$, $n=5$, and $n'=9$, then $\E\bar X=\E\bar Y=\mu=0$, $$\var\bar Y=\frac{\si^2}5>\frac{\si^2}9=\var\bar X,$$ whereas $$\E\bar Y^4=\frac{875}{25515}\not\ge\frac{891}{25515}=\E\bar X^4.$$


The calculations are shown in the image of a Mathematica notebook below:

enter image description here

Related Question