Solved – Empirical Risk Minimization: empirical vs expected and true vs surrogate

machine learningrisk

In Tie-Yan Liu's book, he says that in a statistical learning theory for empirical risk minimization has to observe four risk functions:

We also need to define the true loss of the learning problem, which serves as a reference to study the properties of different surrogate loss functions used by various learning algorithms. Usually the average surrogate loss on the training data is called the empirical surrogate risk (denoted as $\hat{R}_\phi$), the expected surrogate loss on the entire product space of input and output is called the expected surrogate risk (denoted as $R_\phi$), the average true loss on the training data is called the empirical true risk (denoted as $\hat{R}_0$), and the expected true loss on the entire product space of input and output is called the expected true risk (denoted as $R_0$).

I'm not very confident that I fully understood the concepts that are being referred by empirical vs expected and true vs surrogate.

It seems that empirical risk is what we can measure with our training set, which is a small sample in the possible sample space contained in the input/output space. But what about the true vs surrogate risk?

Best Answer

The answer lies in the distinction between your statistical problem and the algorithm you use to solve it. Let's focus on the expected risks. You can see it as the risk you would have with an infinite dataset.

The expected true risk is defined by two elements:

  1. The distribution of your points (in the simplest case, your points are some $(x, y) \in \mathbb{R}^d \times \mathbb{R}$, but it the learning to rank problem it can be more complex)

  2. The loss function. If you have a regression problem, it is likely to the least square loss, if you do classification the proportion of misclassified points, and if you learn to rank it is probably NDCG or MAP.

The loss function depends on the real problem you want to solve. It should not depend on the difficulty of its optimization. For this reason, you may choose a loss that is very difficult to optimize: for example, the proportion of misclassified points for classification.

If the loss is difficult to optimize, it implies that your algorithm will probably optimize another loss which is simpler. For simple examples have a look at http://fa.bianp.net/blog/2014/surrogate-loss-functions-in-machine-learning/

Since you cite Tie-Yan Liu, let's also consider the following example: you want to rank some documents to optimize the NDCG measure. You expected true risk is this one: $$ \max\limits_{f} \int\limits_{Q \times D^m \times \mathbb{R}^m} NDCG(f(q,s), y) ~ d\rho(q, s, y)$$ where q is a query and s a set of documents.

There is no obvious way to optimize NDCG. Something that you may want to do instead is predict the score of each document, i.e solve: $$ \min\limits_{g} \int\limits_{Q \times D \times \mathbb{R}^m} \sum\limits_{i=1}^{m} (g(q, s_i) - y_i)^2 ~ d\rho(q, s, y) $$ Once you have predicted all the scores, you sort the documents by decreasing order. This is your surrogate loss: a loss that you minimize because you know how to do it. What you hope is to choose a good surrogate loss, in the sense that if you manage to make it small, your true loss with also be small.

A final note: for learning to rank with NDCG measure, don't choose the surrogate loss I wrote. Cossock and Zhang (2011) proved that minimizing it does not necessarily imply having a good NDCG.

Related Question