[Math] Azuma’s inequality to McDiarmid’s inequality

inequalitymartingalesmeasure-theoryprobabilityprobability theory

I was going through some notes on concentration inequalities when I noticed that there are two commonly-cited forms of McDiarmid's inequality. Long story short: I know how to prove the weaker one from Azuma's inequality. I also know how to prove the stronger one not directly from Azuma. But is there a way to prove the strong one from Azuma's inequality?

In what follows I will assume for simplicity that all differences are bounded by 1.

Azuma's inequality: Let $S_n = \sum_{i=1}^n X_i$ and $S_0 = 0$. Let the filtration $\{\mathcal{F}_n\}$ be the usual one defined by $\mathcal{F}_n = \sigma(X_1, \ldots, X_n)$. If $\{S_n\}_{n \geq 0}$ is a martingale and $|X_n| \leq 1$ for all $n$, then for $\lambda>0$, $$P(S_n \geq \lambda) \leq \exp \left(-\frac{\lambda^2}{2n} \right)$$

Now here is the "weak" version of McDiarmid's which is an obvious application of Azuma's inequality:

McDiarmid's inequality: Suppose $\xi_1, \ldots, \xi_n$ are independent. Let $Z = f(\xi_1, \ldots, \xi_n)$. Also, assume that: $$
\Big|\; f(\xi_1, \ldots, \xi_k, \ldots, \xi_n) – f(\xi_1, \ldots, \xi_k', \ldots, \xi_n) \;\Big| \leq 1$$ for two realizations of the same random variable $\xi_k$ and $\xi_k'$. Then for $\lambda>0$, $$
P(Z-EZ \geq \lambda) \leq \exp \left(-\frac{\lambda^2}{2n} \right) $$

Now the strong version of McDiarmid's inequality has the same setup except that the "2" in the denominator is moved up to the numerator, so the bound is:
$$P(Z-EZ \geq \lambda) \leq \exp \left(-\frac{2\lambda^2}{n} \right)$$
Clearly this bound is sharper than the one in the version presented above. I know a proof for this bound but it does not explicitly use Azuma's inequality above.

The question: Is it possible to adapt Azuma's inequality to prove this second bound. More specifically, is there a more general form of Azuma's inequality that can be applied to the case of McDiarmid's inequality to yield the tighter bound?

Edit: After pondering it over, I now think that it is not possible. Just intuitively, the bounded martingale difference condition in Azuma's inequality seems somehow "weaker" than the condition in McDiarmid's inequality, which is a Lipschitz-like condition on the function $f$. So one would expect that using Azuma's inequality would not give the sharpest bound.

Edit 2: To shed some more light on this, the proof of the "weak" McDiarmid's using Azuma's inequality goes like this. Define $S_n = E(Z \;|\; \xi_1, \ldots, \xi_m) – EZ$. Then $\{S_m\}_{m \geq 1}$ is a martingale. Also, for $m=n$, $S_m = S_n = Z – EZ$.

To prove that the martingale differences are bounded, let $\xi_1', \ldots, \xi_n'$ be independent copies of the $\xi_1, \ldots, \xi_n$. Then we have:
$$ E\big[f(\xi_1, \ldots, \xi_m, \ldots, \xi_n) \;\big|\; \xi_1, \ldots, \xi_{m-1}\big] = E\big[f(\xi_1, \ldots, \xi_m', \ldots, \xi_n) \;\big|\; \xi_1, \ldots, \xi_m\big] $$
Thus we have:
$$\big|S_m – S_{m-1}\big| \leq E\big[|\, f(\xi_1, \ldots, \xi_m, \ldots, \xi_n) – f(\xi_1, \ldots, \xi_m', \ldots, \xi_n)\,| \;\big|\; \xi_1, \ldots, \xi_m\big]$$
So to satisfy the bounded martingale difference in Azuma's inequality, we only need the expectation of the difference of $f(\cdot) – f(\cdot)$ (conditional on $\xi_1, \ldots, \xi_m$) to be bounded, whereas the assumption in McDiarmid's is slightly stronger. However, the bounded martingale difference must also hold for all $m$, and in particular for the case of conditioning on $\xi_1, \ldots, \xi_n$. In that case it seems to reduce to assumption. So… it does seem like Azuma's inequality (in this form) is not strong enough. Thoughts?

Here is a resource presenting the "standard" proof of the strong version of McDiarmid's inequality: http://empslocal.ex.ac.uk/people/staff/yy267/McDiarmid.pdf

Best Answer

The difference is the sharper constants for the sub-Gaussian norm/parameter of the bounded variables. You can still use Azuma's technique to get the sharper bound. Here are some notes I wrote a while ago:

Sub-Gaussian tail bounds

  • Recall that $\mathbb{E} e^{\lambda X} \le e^{\sigma^2 \lambda^2/2 }$ implies the following tail bound ($\mathbb{E} X = 0$) \begin{align*} \mathbb{P}( X \ge t) \le \exp\Big( {- \frac{t^2}{2\sigma^2}}\Big) \end{align*} Such a random variable is called sub-Gaussian with parameter $\sigma$.

  • A bounded variable $X \in [a,b]$ has squared sub-G parameter $\sigma^2 = (b-a)^2/4$. This sharp bound requires some work to show (left as an exercise.)

Azuma-Hoeffding approach

We want to provide concentration bounds for $Z = f(X) = f(X_1,\dots,X_n)$.

Let $\mathbb{E}_i[Z] := \mathbb{E}[Z | X_1,\dots,X_i]$ and $\Delta_i = \mathbb{E}_i[Z] - \mathbb{E}_{i-1}[Z]$. Then, $\{\Delta_i\}$ is a martingale difference sequence: $\mathbb{E}_{i-1}[\Delta_i] = 0$. Let $S_j := \sum_{i=1}^j \Delta_i$ which is only a function of $X_i, i \le j$. \begin{align*} S_n = \sum_{i=1}^n \Delta_i = Z - \mathbb{E} Z \end{align*} Let us assume that (*) $\mathbb{E}_{i-1} [e^{\lambda \Delta_i}] \le e^{\sigma_i^2 \lambda^2/2}$ for all $\lambda \ge 0$, and all $i$, almost surely. Then, \begin{align*} \mathbb{E}_{n-1} [ e^{\lambda S_n}] = e^{\lambda S_{n-1}}\mathbb{E}_{n-1} [ e^{\lambda \Delta_n}] \le e^{\lambda S_{n-1}} e^{\sigma_n^2 \lambda^2/2} \end{align*} Taking $\mathbb{E}_{n-2}$ of both sides: \begin{align*} \mathbb{E}_{n-2} [ e^{\lambda S_n}] \le e^{\sigma_n^2 \lambda^2/2} \mathbb{E}_{n-2}[ e^{\lambda S_{n-1}} ] \le e^{\lambda S_{n-2}} e^{(\sigma_n^2 + \sigma_{n-1}^2)\lambda^2/2} \end{align*} Repeating the process, we get $\mathbb{E}_{0} [e^{\lambda S_n}] \le \exp ( (\sum_{i=1}^n \sigma_i^2)\lambda^2/2 ) $. Showing that $S_n$ is sub-G with squared-param. $\sigma^2 = \sum_{i=1}^n \sigma_i^2$.

Bounded difference

Conditional sub-G assumption (*) holds under bounded difference \begin{align*} |f(x_1,\dots,x_{i-1},x_i,x_{i+1},\dots,x_n) - f(x_1,\dots,x_{i-1},x_i',x_{i+1},\dots,x_n)| \le L_i \end{align*} Let $g_i(x_1,\dots,x_i) := \mathbb{E}[f(x_1,\dots,x_{i-1},x_i,X_{i+1},\dots,X_n) ]$ so that $\mathbb{E}_i[Z] = g_i(X_1,\dots,X_i)$. This uses independence, i.e. the distribution of the rest does not change by conditioning. It is easy to see that $g_i$ satisfies bounded difference condition with constants $(L_1,\dots,L_i)$, by an application of Jensen.

  1. Naive bound: We have \begin{align*} \Delta_i = \mathbb{E}_i[Z] - \mathbb{E}_{i-1}[\mathbb{E}_i[Z]] = g_i(X_1,\dots,X_i) - \mathbb{E}_{i-1}[g_i(X_1,\dots,X_i)] \end{align*} Conditioned on $X_1,\dots,X_{i-1}$, we are effectively looking at ($X'_i$ is an independent copy of $X_i$) \begin{align*} g_i(x_1,\dots,x_{i-1},X_i)- \mathbb{E}[g_i(x_1,\dots,x_{i-1},X'_i)] \end{align*} due to independence of $\{X_i\}$. Thus, $|\Delta_i| \le L_i$ condition on $X_1,\dots,X_{i-1}$. That is, $\mathbb{E}_{i-1} [e^{\lambda \Delta_i}] \le e^{\sigma_i^2 \lambda^2/2}$ where $\sigma_i^2 = (2L_i)^2/4 = L_i^2$.

  2. Better bound: We can show that $\Delta_i \in I_i$ where $|I_i| \le L_i$, improving the constant by $4$: Conditioned on $X_1,\dots, X_i$, we are effectively looking at \begin{align*} \Delta_i = g_i(x_1,\dots,x_{i-1},X_i) - \mu_i \end{align*} where $\mu_i$ is a constant (only a function of $x_1,\dots,x_{i-1}$). Then, $\Delta_i + \mu_i \in [a_i,b_i]$ where $a_i = \inf_x g_i(x_1,\dots,x_{i-1},x)$ and $b_i = \sup_x g_i(x_1,\dots,x_{i-1},x)$. We have \begin{align*} b_i - a_i = \sup_{x,y} \big[ g_i(x_1,\dots,x_{i-1},x) - g_i(x_1,\dots,x_{i-1},y) \big] \le L_i \end{align*} Thus, we have $\mathbb{E}_{i-1}[e^{\lambda \Delta_i}] \le e^{\sigma^2 \lambda^2/2}$ where $\sigma_i^2 = (b_i-a_i)^2/4 \le L_i^2/4$.

Related Question