Machine Learning – Example of Hoeffding’s Inequality Failure for Empirical Risk Minimization

I am studying Introduction to Statistical Learning Theory by Bousquet, Boucheron and Lugosi. On pages 183 through 185 it considers the applicability of Hoeffding's Inequality to Empirical Risk Minimization (ERM). The setting is as follows.

Let $Z_1,…,Z_n$ be $n$ i.i.d. samples drawn from a distribution. Let $\mathcal{F}$ be a class of functions. For every $f\in\mathcal{F}$, define its risk $R(f)$ and empirical risk $R_n(f)$ by
$$
R(f)=\mathbb{E}f(Z)
\quad\text{and}\quad
R_n(f)=\frac{1}{n}\sum_{i=1}^nf(Z_i).
$$
Let $f^*\in\mathcal{F}$ be the minimizer of $R(f)$ over all $f\in\mathcal{F}$, and $f_n\in\mathcal{F}$ the minimizer of $R_n(f)$ over all $f\in\mathcal{F}$.

For simplicity assume that $f(z)\in[0,1]$ for all $z$. Hoeffding's Inequality says that for all $\delta\in(0,1]$, for a fixed $f\in\mathcal{F}$,
$$
\mathbb{P}\left(\left|R(f)-R_n(f)\right|\le\sqrt{\frac{\log(2/\delta)}{2n}}\right)\ge 1-\delta.
$$
However, this result does not say anything about $|R(f_n)-R_n(f_n)|$ because $f_n$ is random.

My question: Is there any concrete example showing that $|R(f_n)-R_n(f_n)|$ does not obey the bound $\sqrt{\log(2/\delta)/(2n)}$ with probability $1-\delta$?

Best Answer

The problem isn't so much finding an example -- the result will typically fail for some $n$ and $\delta$ when ${\cal F}$ has more than one entry -- as doing the calculations when most of the available tools are designed to give upper bounds.

Here's a ridiculously extreme version: for every finite set of points $S$ in $\mathbb R$ define a function $f_S(z)$ that is 0 if $z\in S$ and 1 otherwise. Let $\cal F$ be the set of such functions. Then $f_n$ will be a function that is 0 at every observed point, giving $R_n(f_n)=0$ a.s.

On the other hand, since all finite sets have measure zero, for any fixed $f_S$ we will observe no points in $S$ (a.s.) and have $f(Z_i)=1$ a.s and so $R(f)=1$.

This does illustrate what causes the problem: overfitting. When ${\cal F}$ has multiple competing $f$s, the one that gets chosen as $f_n$ will typically have $R_n(f_n)<R(f_n)$, because that's how it got chosen.

Best Answer

Related Solutions

Machine Learning – Distinguishing Between Plug-In Classifiers and ERM Classifiers

Machine Learning – Why Hoeffding’s Inequality Can’t Be Used as the Final Bound if the Learnt Hypothesis is Part of $\mathcal{H}$

Related Question