Solved – What does improper learning mean in the context of statistical learning theory and machine learning

machine learningmathematical-statistics

I was reading the following paper and it talked about improper learning. I wasn't 100% what it rigorously meant but they do mention:

enter image description here

I am not sure what "representation independent" means, but as far as I am concerned, the learning algorithm is allowed to choose functions that might not be exactly within the restriction of the concept class $C$ that it is trying to learn. i.e. it might be trying to learn some specific type of function from samples (say trying to learn a half-space $\mathcal{H}_{n,k}$) but we allow it to choose functions of not that type, with the restriction that its error is not very far from the best of the original class in consideration.

I guess the intuition is more or less clear, but was wondering if this interpretation is correct and whether there is a more precise/rigorous way to define what improper learning is.

Best Answer

In statistical learning theory, the standard batch learning problem is defined in terms of a distribution $P$ over some space $\mathcal{Z}$ belonging to some set of distributions $\mathcal{P}$, a hypothesis class $\mathcal{H}$ and a loss function $\ell$, which assigns a (say) nonnegative real ("a loss") to pairs $(P,h)$ of distributions and hypotheses (i.e., $\ell: \mathcal{P} \times \mathcal{H} \to [0,\infty)$). Then, one is given a sequence of $n$ points $D_n = (Z_1,\dots,Z_n)\in \mathcal{Z}^n$, sampled in an iid fashion from $P$ and the job of the learning algorithm is to come up with a hypothesis $h_n\in \mathcal{H}$ based on $D_n$ that achieves a small (say) expected loss $\mathbb{E}[\ell(P,h_n)]$ (the random quantity in the above expression is $h_n$: $h_n$ depends on the data $D_n$, which is random, hence $h_n$ is also random).

In terms of the goal of learning, one criterion for evaluating the power of a learning algorithm is how fast the excess expected loss $\mathbb{E}[\ell(P,h_n)]-\inf_{h\in \mathcal{H}} \ell(P,h)$ (or excess risk) decreases with $n\to \infty$.

Improper learning changes this metric slightly to evaluate success by $\mathbb{E}[ \ell(P,h_n) ] - \inf_{h\in \mathcal{H}_0 } \ell(P,h)$ for some $\mathcal{H}_0\subset \mathcal{H}$. Intuitively, when $\mathcal{H}_0$ is a true subset of $\mathcal{H}$, competing with the best hypothesis from $\mathcal{H}_0$ should be easier.

Where is this coming from? Learning is all about guessing the right bias. The bias here is expressed in terms of $\mathcal{H}_0$. The designer of the algorithm makes a guess on $\mathcal{H}_0$; the guess concerns that there will be a hypothesis in $\mathcal{H}_0$ which achieves a small loss. Next, the problem is to design an algorithm. However, does it make sense to require the algorithm to output hypotheses from $\mathcal{H}_0$? Unless some specific circumstances require this, why would we make this restriction? By allowing the learning algorithm to produce hypotheses in a larger class $\mathcal{H}_1$ which is in between $\mathcal{H}_0$ and $\mathcal{H}$, the algorithm designer's flexibility is increased and hence potentially lower excess risk over the best hypothesis in $\mathcal{H}_0$ can be achieved. Why not allow then $\mathcal{H}_1 = \mathcal{H}$? The answer to this depends on how the learning algorithm uses $\mathcal{H}_1$. If it really just uses a (potentially small) subset of it, then it won't hurt to have $\mathcal{H}_1 = \mathcal{H}$. However, many learning algorithms are designed to use the full hypothesis space that they are given and they slow down (will be more conservative) when used with a larger hypothesis class. With such algorithms it makes sense to use a proper subset of $\mathcal{H}$ as $\mathcal{H}_1$.