Machine Learning – Why Hoeffding’s Inequality Can’t Be Used as the Final Bound if the Learnt Hypothesis is Part of $\mathcal{H}$

machine learningprobability

This question has similar answers somewhere, but I do not understand them still:

In the notes here, we see the definition of PAC Learning:

Definition 3 (Generalization Gap). Given an sample set $S=\left(x_i, y_i\right), i \in\{1, \ldots, m\}$, drawn i.i.d. from $\mathcal{D}$, a hypothesis $h_S$ learnt on $S$, and a specific definition of loss $l$, the generalization gap is defined as
$$
\epsilon_{\text {gen }}\left(h_S\right)=\left|R\left[h_S\right]-\hat{R}_S\left[h_S\right]\right|
$$

Definition 9 (PAC Learning). A hypothesis class $\mathcal{H}$ is (Agnostic) PAC Learnable if given some arbitrary $\epsilon, \delta>0$ there exist an $m_{\mathcal{H}}(\epsilon, \delta)$ such that for any $S$ with $|S|>m_{\mathcal{H}}(\epsilon, \delta)$ we have $\epsilon_{g e n}\left(h_S\right) \leq \epsilon$ with probability at least $1-\delta$. The "probably" (P) part of PAC corresponds to $1-\delta$ while the "approximately correct" (AC) part corresponds to $\epsilon$. $m_{\mathcal{H}}(\epsilon, \delta)$ is known as the Sample Complexity of the hypothesis class.

We will now show that any finite hypothesis class $\mathcal{H}$ is agnostic PAC learnable. We first derive a probabilistic bound on the following distance that holds for any $h \in \mathcal{H}$ :
$$
\left|R[h]-\hat{R}_S[h]\right|
$$

We notice that this is just the absolute distance between the empirical average $\hat{R}_S[h]$ and its mean since:
$$
\mathbb{E}\left[\hat{R}_S[h]\right]=\mathbb{E}\left[\frac{1}{m} \sum_{i=1}^m \ell\left(h\left(x_i\right), y_i\right)\right]=\frac{1}{m} \sum_{i=1}^m \mathbb{E}\left[\ell\left(h\left(x_i\right), y_i\right)\right]=R[h]
$$

We can use Hoeffding's inequality to decide how many samples we would need to take to guarantee that
$$
P\left(\left|\hat{R}_S[h]-R[h]\right|<\epsilon\right)>1-\delta
$$

If we set $\delta=\exp \left(-2 m \epsilon^2\right)$ we can solve to get $m=O\left(\frac{-\log (\delta)}{\epsilon^2}\right)$. Note: this is a lower bound on the sample size which guarantees the statement above, (see Sample Complexity in the PAC Learning definition above).

Now, suppose we consider an arbitrary $h \in \mathcal{H}$ and a loss function $\ell$ with range $[0,1]$. For the random variable $\hat{R}_S[h]$ we can use Hoeffding's inequality to get:
$$
P\left(\left|\hat{R}_S[h]-R[h]\right| \geq \epsilon\right) \leq 2 \exp \left(-2 m \epsilon^2\right)
$$

Note: We cannot simply replace $h$ with $h_S$ in this bound because the loss on the data points will not be independent any more, and therefore Hoeffding inequality will not hold.

Subsequently, to bound the generalization bound, a first attempt is to use Union bound:

So to extend this bound for $\epsilon_{\text {gen }}\left(h_S\right)$ we will use union bound:
$$
\begin{aligned}
P\left(\left|\hat{R}_S\left[h_S\right]-R\left[h_S\right]\right| \geq \epsilon\right) & \leq P\left(\max _{h \in \mathcal{H}}\left|\hat{R}_S\left[h_S\right]-R\left[h_S\right]\right|>\epsilon\right) \\
& =P\left(\bigcup_{h \in \mathcal{H}}\left\{\left|\hat{R}_S[h]-R[h]\right|>\epsilon\right\}\right) \\
& \stackrel{(a)}{\leq} \sum_{h \in \mathcal{H}} P\left(\left|\hat{R}_S[h]-R[h]\right|>\epsilon\right) \\
& =2|\mathcal{H}| \exp \left(-2 m \epsilon^2\right)
\end{aligned}
$$

Where (a) follows using a union bound argument. We can prove this in the case of 2 events and then use induction. In this case $P(A \cup B)=P(A)+P(B)-P(A \cap B) \leq P(A)+P(B)$.


My question: I understand that the Hoeffding's Inequality will break if you replace $h$ by $h_S$ due to i.i.d. assumption being broken, then what is stopping from using $h_S$ in the Hoeffding's Inequality since $h_S$ is amongst one of the hypothesis in the set $\mathcal{H}$? Does the equation below mean/suggest that amongst all the $h \in \mathcal{H}$, there is a "worst" $h$?

$$
\max _{h \in \mathcal{H}}\left|\hat{R}_S\left[h_S\right]-R\left[h_S\right]\right|
$$


EDIT: Here's what I thought;

  1. The Hoeffding's inequality does not hold for $h_S$ after learning.

  2. This means we need to find a bound for $h_S$ since it failed to hold in point 1.

  3. But we do know the $\epsilon_{gen}(h_S)$ is smaller than $\epsilon_{gen}(h_{worst})$ where $h_{worst}$ is the worst $h$ (gives the largest gen gap).

  4. This $h_{worst}$ can be expressed as a union of all $h$'s.

  5. So the assumption is not broken if we enumerate all $h \in \mathcal{H}$.

Best Answer

Does the equation below mean/suggest that amongst all the $h \in \mathcal{H}$, there is a "worst" $h$?

$$ \max _{h \in \mathcal{H}}\left|\hat{R}_S\left[h_S\right]-R\left[h_S\right]\right| $$

The gap differs for different hypotheses.

In particular, a model that is always wrong, will have $\hat{R}_S\left[h_S\right]-R\left[h_S\right] = 0$ because it has the same maximum losses for $\hat{R}_S\left[h_S\right]$ and ${R}\left[h_S\right]$

My question: I understand that the Hoeffding's Inequality will break if you replace $h$ by $h_S$ due to i.i.d. assumption being broken, then what is stopping from using $h_S$ in the Hoeffding's Inequality since $h_S$ is amongst one of the hypothesis in the set $\mathcal{H}$?

The Hoeffding's inequality is applied to the situation where there is a single hypothesis $h$. With the learning algorithm you have however multiple hypotheses $h_1,h_2, \dots, h_n$ out of which you select the one with the with the lowest empirical risk $\hat{R}_S(h_i)$.

The $\hat{R}_S(h_i)$ might (and probably will) correlate with the gap $\epsilon(h_i)$. For instance, overfitting makes us select a hypothesis with a low risk $\hat{R}_S(h_i)$ that is likely gonna be underestimated. The selection for the lowest risk will make the gap probably larger than without the selection.

That is why you can not simply use $h_S$ directly in Hoeffding's inequality. The gap of the hypothesis $h_S$ can be larger than what we expect from the formula that considers a single hypothesis (and not one that has been selected based on an optimal $\hat{R}_S$).

Related Question