Machine Learning – Explanation of ERM Rule from Shalev-Shwartz and Ben-David

machine learningpac-learning

I am reading "Understanding Machine Learning" book by Shalev-Shwartz and Ben-David.
On page 48, theorem 6.7 (The fundamental theorem of statistical learning, FTSL) says: Let $H$ be a hypothesis class… the following properties are equivalent… 4. $H$ is PAC learnable. 5. Any ERM rule is a successful PAC learner for $H$.

I could not find the definition of ERM rule in this book or anywhere. What is it?

It can not be any algorithm which minimizes empiric risk, can it?

What are the conditions on the algorithm to find the PAC solution?

Best Answer

In section 6.4 of Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David, it is written

Theorem 6.7 (The Fundamental Theorem of Statistical Learning) Let $\mathcal{H}$ be a hypothesis class of functions from a domain $\mathcal{X}$ to $\{0,1\}$ and let the loss function be the $0 - 1$ loss. Then, the following are equivalent:

  1. $\mathcal{H}$ has the uniform convergence property.
  2. Any ERM rule is a successful agnostic PAC learner for $\mathcal{H}$.
  3. $\mathcal{H}$ is agnostic PAC learnable.
  4. $\mathcal{H}$ is PAC learnable.
  5. Any ERM rule is a successful PAC learner for $\mathcal{H}$.
  6. $\mathcal{H}$ has a finite VC-dimension.

When the authors write "the following are equivalent", what they mean is $$ \text{statement 1} \implies \text{statement 2} \implies \text{statement 3} \implies \text{statement 4} \implies \text{statement 5} \implies \text{statement 6} \implies \text{statement 1} $$

So, if statement 1 is true, then statement 2 is true, and if statement 2 is true, then statement 3 is true, and so on, until we form a loop back to statement 1. After listing these statements, the authors then write

The proof of the theorem is given in the next section.

In section 6.5, it is written

We have already seen that $1 \implies 2$ in Chapter 4. The implications $2 \implies 3$ and $3 \implies 4$ are trivial and so is $2 \implies 5$. The implications $4 \implies 6$ and $5 \implies 6$ follow from the No-Free-Lunch theorem.


Response to first comment

You only write what is obvious. So, what is ERM rule? What is the definition of ERM rule?

Unfortunately, it seems that the wording of Theorem 6.7 is a bit vague. In section 8.2, the authors write

Given a hypothesis class, $\mathcal{H}$, a domain set $Z$, and a loss function $\ell$, the corresponding $\text{ERM}_{\mathcal{H}}$ rule can be defined as follows:

On a finite input sample $S \in Z^m$, output some $h \in \mathcal{H}$ that minimizes the empirical loss $$ L_S(h) = \frac{1}{|S|} \sum_{z \in S} \ell (h,z) $$

Therefore, every ERM rule is dependent on four choices:

  • the hypothesis class $\mathcal{H}$,
  • the domain set $Z$,
  • the finite input sample $S \in Z^m$ (or "training dataset"), and
  • the loss function $\ell$

In theorem 6.7, it is written

...Let $\mathcal{H}$ be a hypothesis class of functions from a domain $\mathcal{X}$ to $\{0,1\}$ and let the loss function be the $0 - 1$ loss...

which means that the choices of the hypothesis class $\mathcal{H}$, the domain set $Z$, and the loss function $\ell$ have already been made. This leaves the choice of the finite input sample (training dataset) $S \in Z^m$. This suggests that when the authors write in Theorem 6.7

Any ERM rule...

What they mean is

For any finite input sample (training dataset) $S \in Z^m$, the ERM rule...

Response to second comment

For example. An algorithm picks 2 random functions from the class $H$ and selects one with the lowest empirical risk. This is an ERM algorithm. Will this algorithm be a successful PAC learner for $H$ when $H$ is PAC learnable?

Yes, assuming that you have made your choice of the finite input sample (training dataset) $S \in Z^m$, this is just $$ \text{statement 4} \implies \text{statement 5} $$ in theorem 6.7.

Response to third comment

The section 4,2 has only two statements. One is Hoeffding inequality, another is about a finite class of functions. This one states, in particularly, that "The class is agnostically PAC learnable using the ERM algorithm". But it does not specify anything about "the ERM algorithm" and there is no proof that it will work with any ERM algorithm.

I think section 4.2 complements section 4.1, so see section 4.1 as well. Note that in my original answer, I wrote

Section 4.2 seems to be what the authors are referring to here.

which means that I wasn't sure if section 4.2 is what they were referring to. Regardless, I've removed this statement from my original answer.

Response to fourth comment

If ERM rule is a rule which finds a hypothesis with minimum ER in the class $H$, then such an algorithm may not even exist for infinite classes. What does the FTSL say then? If there are no algorithms to find absolute minimum of ER every time, the class of function is not PAC learnable then?

I am not sure what you mean by "infinite classes" here. If you mean that $\mathcal{H}$ is infinite in size, then we may not be interested in infinitely sized hypothesis classes anyway. In section 2.3, the authors write

We have just demonstrated that the ERM rule might lead to overfitting. Rather than giving up on the ERM paradigm, we will look for ways to rectify it. We will search for conditions under which there is a guarantee that ERM does not overfit, namely, conditions under which when the ERM predictor has good performance with respect to the training data, it is also highly likely to perform well over the underlying data distribution.

and in section 2.3.1, the authors write

The simplest type of restriction on a class is imposing an upper bound on its size (that is, the number of predictors $h$ in $\mathcal{H}$). In this section, we show that if $\mathcal{H}$ is a finite class then $\text{ERM}_{\mathcal{H}}$ will not overfit, provided it is based on a sufficiently large training sample (this size requirement will depend on the size of $\mathcal{H}$).

Related Question