Tail probability of summation of (possibly dependent) Bernoulli variables

bernoulli-distributiondistribution-tailspoisson distributionprobabilityprobability distributions

We have a series of $n$ Bernoulli random variables,
$X_1, X_2, \ldots, X_n$, where
$X_i \sim \operatorname{Bernoulli}(p_i), \forall i \in [n]$.
Let $Y = \sum_{i = 1}^n X_i$ be the random variable that is the sum of these $n$ random variables.

When these random variables are independent, it is well known that $Y$ follows a Poisson binomial distribution, and we have a Chernoff bound. See, e.g., https://en.wikipedia.org/wiki/Poisson_binomial_distribution.

But when the random variables are dependent, it becomes difficult.
Here I want to have an upper bound of the tail probability
$\Pr[Y \geq k]$, for some $\mathbb{E}[Y] \leq k \leq n$.

For a special case, let us consider $n + 1$ independent categorical random variables $R_0, R_1, \ldots, R_n$ taking variables in $[c]$ for some $c \in \mathbb{N}$, where each $R_i$ has $\Pr[R_i = t] = q_{it}, \forall i \in [n] \cup \{0\}, t \in [c]$.
Now we let $X_i$ be the indicator random variable indicating the event that $R_0$ and $R_i$ are in the same category.
In this special case, $Y$ represents the total number in $[n]$ such that $R_0$ and $R_i$ are in the same category, and the tail probability $\Pr[Y \geq k]$ measures the probability that such a number is "high".

Any ideas and help would be highly appreciated.


There are some known results, such as Markov's inequality
$$\Pr[Y \geq k] \leq \frac{\mathbb{E}[Y]}{k}$$
and some generalized results (see, e.g., Linial, Nathan, and Zur Luria. "Chernoff's Inequality-A very elementary proof." arXiv preprint arXiv:1403.7739 (2014).)

Best Answer

For your special case, we can obtain the required probability as follows. Firstly, note that \begin{align*} \Pr(Y \geq k) = \sum_{t = 1}^c \Pr(Y \geq k| \{R_0 = t\})\Pr(R_0 = t). \end{align*} Computing this conditional probability is easier as you can use the standard bounds. Define $Z_i$ to be the indicator variable of the event that $R_i$ takes a particular value $t$. Then, \begin{align*} \Pr(Y \geq k| \{R_0 = t\}) = \Pr\left( \sum_{i = 1}^n Z_i \geq k \right). \end{align*} Since $Z_i$'s are independent, you can use standard concentration inequality of your choice. Obtain your final result by plugging it back into the first equation.

In general, I am not aware of any concentration inequalities that incorporate any general dependency structure between random variables. For the case, when your random variables are sub-Gaussian (like Bernoulli), and in particular $X_i$ is $\sigma_i^2$-sub-Gaussian, then the sum $\sum_{i = 1}^n X_i$ is in general $(\sigma_1 + \sigma_2 + \dots + \sigma_n)^2$-sub-Gaussian (which can be very loose), irrespective of the correlations between the random variables. In the event the $X_i$'s are independent, it can be tightened to $(\sigma_1^2 + \sigma_2^2 + \dots + \sigma_n^2)$-sub-Gaussian.

Related Question