Using Hoeffding inequality for risk / loss function

inequalitiesmachine learningpr.probabilityprobability distributionsst.statistics

I've got a question to the Hoeffding Inequality which states, that for data points $X_1, \dots, X_n \in X$, which are i.i.d. according to a probability measure $P$ on $X$, we find an upper bound for:

$$P\left(\left|{\sum_{I=1}^n}X_i – \int_XX_1dP(x)\right| \ge \varepsilon\right) \le \alpha.$$

In machine learning proofs, this is often used for bounding functions, hence we get for a function $f: X \rightarrow \mathbb{R}$ a bound:

$$P\left(\left|{\sum_{I=1}^n}f(X_i) – \int_Xf(X_1)dP(x)\right| \ge \varepsilon\right) \le \alpha.$$

I don't understand, why we are able to obtain this bound with the same probability measure $P$. Why don't we need to look at a different / transformed probability measure $P_f$?

Best Answer

I think this question is more on the definition side. Often times people don't distinguish $\mathbb{P}$ the probability measure in the underlying probability space, and $P$ the distribution or the probability measure induced by a random variable. Also people tend to use $P$ directly, without making any explicit reference to the original probability space. This can cause confusion.

The first inequality may be restated more precisely as follows. Let $X_i: \Omega \to \mathcal{X}$ be a real-valued Borel measurable map from the probability space $(\Omega, \mathcal{F}, \mathbb{P})$, for each $i=1,...,n$. Let $X_i$'s have a common distribution $P$. Then the first inequality should be $$ \mathbb{P}\Big\{ \omega \in \Omega : \Big| \sum_{i=1}^n X_i(\omega) - \int_{\mathcal{X}} x P(dx) \Big| > \varepsilon \Big\} \leq \alpha, $$ which can be written as $$ \mathbb{P}\Big\{ \Big| \sum_{i=1}^n X_i - \int_{\mathcal{X}} x P(dx) \Big| > \varepsilon \Big\} \leq \alpha. $$ Notice $\mathbb{P}$ should be used instead of the distribution $P$. In fact, the first inequality in your post doesn't really make sense when you read it rigorously: Since $P$ is a probability measure on $\mathcal{X}$, the argument of function $P$ should be a subset of $\mathcal{X}$.

The second inequality should be of the form $$ \mathbb{P}\Big\{ \omega \in \Omega : \Big| \sum_{i=1}^n f(X_i(\omega)) - \int_{\mathcal{X}} f(x) P(dx) \Big| > \varepsilon \Big\} \leq \alpha'. $$ Notice you're essentially taking probability of some collection of outcomes in $\Omega$, which has nothing to do with the function $f$.

Related Question