Is the sum of low-probability dependent Bernoulli variables close to zero with high probability

probabilityprobability distributionssummation

Assume $X_1,\dots,X_n$, $X_i\sim \mathrm{Ber}(\epsilon)$ are (possibly) dependent Bernoulli random variables with $P(X_i=1) = 1-P(X_i=0) = \epsilon$ for all $1\leq i\leq n$ and denote $Y_n = \sum_{i=1}^n X_i$.

Question. Is it true that for any $\epsilon>0$, there exists $n_0 \in \mathbb{N}$ such that for all $n \geq n_0$ and for all random variables $X_1,\dots,X_n$ with marginal distributions $X_i \sim \mathrm{Ber}(\epsilon)$, we have
$$ P\left(Y_n\leq 2 \epsilon n\right) \geq 1-\epsilon? $$

If the above question is answered with "no", can we choose a different (larger) constant $c$ instead of $2$ to make this statement correct?

Here is my thought process: For independent variables this statement is directly correct due to, e.g., the weak law of large numbers. Also, for very dependent variables, i.e., $X_1=\dots=X_n$ this statement is correct, since $P\left(Y_n\leq 2\epsilon n\right) = 1-\epsilon$ (the probability that all $X_1=\dots=X_n=0$ is $1-\epsilon$). I am unsure however about other possible dependencies.

I will be glad about any hints or references to the literature.

Best Answer

I am still confused about the intended quantifier order, but let me rewrite the question as I think it is meant.

Question. Is the following claim true? For all $\epsilon>0$, there exists $n_0$ such that for all $n>n_0$ and for all $\text{Bernoulli}(\epsilon)$ random variables $X_1,\ldots,X_n$ we have $P(\sum_i X_i \le 2\epsilon n) \ge 1-\epsilon$.

Assuming this is the meaning, the answer is no. One counterexample suffices to prove the claim false, but you can also modify its parameters for other similar claims (e.g. if the 2 is replaced with some other constant).

Let $\epsilon=0.01$ and $n=100k$, where $k$ is any positive integer. Let $X_i$ be dependent in this way: With $1/4$ probability, a randomly chosen subset of $4k$ of them are ones, and the other $96k$ are zeros. With $3/4$ probability, all $X_i$ are zeros. Clearly for any $i$ we have $P(X_i=1) = (1/4)(4k/100k) = 0.01 = \epsilon$ as required.

But we have $1/4$ probability that $Y_n = \sum_i X_i = 4k$. So $P(Y_n \le 2\epsilon n) = P(Y_n \le 2k) = 3/4$, much less than $1-\epsilon=0.99$.

Weaker claim (This time it is true)

But the title of the question asks for something much weaker: is the sum close to zero with high probability? Yes, this we can prove, but we just cannot be too greedy with those two things, "close" and "high", at the same time. Let's be more moderate, and then we can do it.

Claim. Let $\epsilon>0$ and $n \in \mathbb{Z}_+$ be arbitrary. Let $X_1,\ldots,X_n$ be $\text{Bernoulli}(\epsilon)$ variables and $Y_n=\sum_{i=1}^n X_i$. Then for any $c>0$ we have $$ P(Y_n \le cn\epsilon) \ge 1-1/c. $$

Proof. $Y_n$ is a nonnegative random variable with $\mathbb{E}Y = n\epsilon$. The claim follows from Markov's inequality.

For example, if we take $c=1/\sqrt{n\epsilon}$, then we have something reminiscent of the original question: $$ P(Y_n \le \sqrt{n \epsilon}) \ge 1-\sqrt{n\epsilon}. $$

So you will have $Y_n$ "close to zero with high probability" after all. Just don't be too greedy about it.