Interpretation of $\mathcal{D}$ and difference between the accuracy parameter and training error

machine learning

I am studying the book "Understanding Machine Learning:
From Theory to Algorithms". They introduce the following things, which I'm not sure i'm interpreting correctly:

The training error(p35) $$L_S(h):=\frac{|\{i\in[m]:h(x_i)\neq y_i\}|}{m}$$
Where $S=((x_1,y_1),…,(x_m,y_m))$ is the training data.

Then they define the error of a classifier (p34):

Formally, given a domain subset, $A\subset \mathcal{X}$, the probability distribution $\mathcal{D}$, assigns a number, $\mathcal{D}(A)$, which determines how likely it is to observe a point $x\in A$. We define the error of a prediction rule, $h:\mathcal{X}\to \mathcal{Y}$, to be $$L_{\mathcal{D},f}(h):=P_{x\sim\mathcal{D}}[h(x)\neq f(x)]:=\mathcal{D}(\{x:h(x)\neq f(x)\})$$.

Where they define $\mathcal{D}$ as the probability distribution over $\mathcal{X}$

Question 1: How exactly is $\mathcal{D}$ supposed to be interpreted? I will give a trivial example and please tell me if this is the correct way I have interpreted it: For example, let $\mathcal{X}=\{x_1,x_2,x_3\}$ and $\mathcal{Y}=\{0,1\}$. Let $f(x_i)=1$ if $x\geq 0$ and $0$ otherwise. Then let $\mathcal{D}$ be such that $P(x_1\geq 0)=1/2, P(x_2\geq 0)=1/3, P(x_3 \geq 0)=1/4$. Is this a reasonable example of how to interpret $\mathcal{D}$?Then on p38 last sentence, they introduce the accuracy parameter $\epsilon$.
Question 2:What is essentially the difference between the accuracy parameter and the training error?

Best Answer

In my view, the assumption of a deterministic relationship between labels and inputs made in chapter 2 of that book can be interpreted, but it's a bit artificial because it's not probabilistic machine learning in the way we normally think about it. Your queries in my view, are best dealt with not in isolation, but in concert with the crux of what that book is about. Here's how the statistical learning framework normally proceeds. I will use the author's informal papaya analogy first, then formalise it using the tools in chapter 3.

We can conceive of all possible papayas on an island and their tastiness. We are interested in predicting whether a papaya is tasty or not based on a fixed set of observed characteristics, such as softness, smell, shape etc. It's also the case that two papayas with exactly the same observed characteristics may ultimately differ in whether or not they are tasty.

Formally, we can set up the problem as the following. You have input features $X \in \mathcal{X}$ and labels $Y \in \mathcal{Y}$. For binary classification, if we are interested in $d$ papaya characteristics, we might let the input space $\mathcal{X} = \mathbb{R}^d$ and $\mathcal{Y} = \{0, 1\}$. We now assume that there is a joint probability distribution $\mathcal{D} = P(X, Y)$ over input features and labels, that this is a probability distribution encodes the fact that papayas with the same observed characteristics may not all necessarily get the same class label.

The goal is to estimate a predictor, $h$ which is a function $h: \mathcal{X} \rightarrow \mathcal{Y}$. In order to so, we need a criterion for choosing a predictor $h$ from some restricted class of predictors $\mathcal{H}$. Ideally, we want to select $h$ from $\mathcal{H}$ in such a way so as to minimise the true error/risk/generalisation error

$$L_{\mathcal{D}}(h) := \underset{(x, y) \sim \mathcal{D}}{\mathbb{P}} \left( h(x) \neq y \right).$$

The problem however is that we do not know $\mathcal{D} = P(X, Y)$. In terms of the statistical angle on the authors' analogy, it might be infeasible to go around the entire island recording all observed characteristics of papayas and their tastiness. However, we do have indirect, limited access to $\mathcal{D}$ through a training data sample $(x_1, y_1), \dots (x_m, y_m) \sim \mathcal{D}$ consisting of $m$ papayas' observed characteristics and their tastiness labels. So we have to be content with selecting $h$ to minimise an empirical estimate of the true error, and we call this the training error/empirical risk

$$L_S(h) := \frac{\vert i \in [m] : h(x_i) \neq y_i \vert}{m} = \frac{1}{m} \sum^m_{i=1} \mathbb{I} (h(x_i) \neq y_i).$$

Notice that when $h$ is a single, arbitrary, fixed predictor from $\mathcal{H}$, then $L_{\mathcal{S}}(h)$ is a sample mean (estimator) of the unknown population quantity $L_{\mathcal{D}}(h)$. When $h$ is fixed, then the former is a random variable while the latter a fixed constant.

In order for our strategy of choosing $h$ from $\mathcal{H}$ by minimising $L_{S}(h)$ as an estimate of $L_{\mathcal{D}}(h)$ to work, we need to be sure that these quantities are 'close together'. The subtlety is that there are multiple $h$s in $\mathcal{H}$, not just one fixed $h$ in $\mathcal{H}$, and we don't know in advance exactly which $h$ will be responsible for minimising $L_S(h)$. Hence we require that an entire collection of random variables $L_S(h)$, one for each $h \in \mathcal{H}$, to be 'simultaneously close to' the collection of constants $L_{\mathcal{D}}(h)$ for each $h \in \mathcal{H}$. This would guarantee that any $h$ selected from $\mathcal{H}$ using our strategy would minimise a 'good' approximation of the true error $L_{\mathcal{D}}(h)$. Formally, we require a statement of the form

$$\begin{align}\mathbb{P} \left( \forall h \in \mathcal{H}, \vert L_{S}(h) - L_{\mathcal{D}}(h) \vert \leq \epsilon \right) &= \mathbb{P} \left( \bigcap_{h \in \mathcal{H}} \vert L_{S}(h) - L_{\mathcal{D}}(h) \vert \leq \epsilon \right) \\ &= \mathbb{P} \left( \sup_{h \in \mathcal{H}} \vert L_{S}(h) - L_{\mathcal{D}}(h) \vert \leq \epsilon \right) \\ & \geq 1 - \delta. \end{align}$$

That is, with probability at least $1 - \delta$ over all $m$-size training samples, the training error $L_S(h)$ is within $\pm \epsilon$ of the true error $L_{\mathcal{D}}(h)$ for every $h \in \mathcal{H}$. Parsing this in the PAC/statistical learning theory dialect, with confidence $1 - \delta$, the empirical risk functional $L_S(\cdot)$ is uniformly close, i.e. up to an accuracy parameter of $\epsilon$, to the true risk functional $L_{\mathcal{D}}(\cdot)$ over the entire space $\mathcal{H}$. So far, this has assumed that $\mathcal{H}$ is finite. Developing tools for and identifying conditions under which uniform bounds/convergence statements like the above hold when $\mathcal{H}$ is infinite is the central preoccupation of both that book and statistical learning theory in general.

So $\epsilon$ is a real-analysis like arbitrary constant which retains the semantics of proximity. Additionally, in this particular context, it refers to the accuracy of the approximation of $L_{\mathcal{D}}(\cdot)$ using $L_S(\cdot)$, and is distinct from training error.

Lastly, to interpret the $\mathcal{D}$ in the more artificial context of chapter 2, note that all of the above uses $\mathcal{D} = P(X, Y) = P(Y | X)P(X)$. Which implies a distribution on papaya observable characteristics $P(X)$, and a distribution on papaya tastiness given papaya observable characteristics $P(Y | X)$. The setting of chapter 2 is $\mathcal{D} = P(X)$ and $y = f(x)$, meaning that the only stochasticity is in the distribution of papaya observable characteristics $P(X)$, while the relation between papaya tastiness and observable characteristics $y = f(x)$ is fully deterministic.

To better address interpretation of $\mathcal{D}$, here is an example in both cases. For illustrative purposes, assume 3 binary features in the input space $X \in \{0, 1\}^3$ and $Y \in [0, 1]$.

When $\mathcal{D} = P(X, Y)$, then this distribution encodes statements like $$\mathbb{P}(X_1 = \text{soft}, X_2 = \text{not oval-shaped}, X_3 = \text{rotten smelling}, Y = \text{not tasty}) = 0.50.$$

When instead of $P(X, Y)$, we have $\mathcal{D} = P(X)$ together with $f$ then this gives statements like

$$\mathbb{P}(X_1 = \text{soft}, X_2 = \text{not oval-shaped}, X_3 = \text{rotten smelling}) = 0.25$$

together with a 'codebook', in the form of a deterministic function $f$ telling me that $f(\text{soft}, \text{not oval-shaped}, \text{rotten-smelling}) = \text{not tasty} = y$ with certainty.

Best Answer

Related Solutions

Solved – Calculate number of support vectors in SVM

Solved – Should redundant explanatory variables be discarded

Related Question