Let $I$ be the indicator function: it is equal to $1$ for true arguments and $0$ otherwise. Pick $0\lt\alpha\lt 1$ and set
$$\Lambda_\alpha(x)=\alpha x\, I(x\ge 0) - (1-\alpha)x\, I(x\lt 0).$$
This figure plots $\Lambda_{1/5}$. It uses an accurate aspect ratio to help you gauge the slopes, which equal $-4/5$ on the left side and $+1/5$ on the right. In this case excursions above $0$ are heavily downweighted compared to excursions below $0$.
This is a natural function to try because it weights values $x$ that exceed $0$ differently than $x$ that are less than $0$. Let's compute the associated loss and then optimize it.
Writing $F$ for the distribution function of $X$ and setting $L_\alpha(m,x) = \Lambda_\alpha(x-m)$, compute
$$\eqalign{
\mathbb{E}_F(L_\alpha(m,X))&=\int_\mathbb{R} \Lambda_\alpha(x-m)dF(x)\\
&=\alpha\int_\mathbb{R} I(x\ge m)(x-m) dF(x) - (1-\alpha)\int_\mathbb{R} (x-m)I(x\lt m) dF(x)\\
&=\alpha\int_m^\infty(x-m)dF(x) - (1-\alpha)\int_{-\infty}^m(x-m) dF(x).
}$$
As $m$ varies in this illustration with the Standard Normal distribution $F$, the total probability-weighted area of $\Lambda_{1/5}$ is plotted. (The curve is the graph of $\Lambda_{1/5}(x-m)dF(x)$.) The right-hand plot for $m=0$ most clearly shows the effect of downweighting the positive values, for without this downweighting the plot would be symmetric about the origin. The middle plot shows the optimum, where the total amount of blue ink (representing $\mathbb{E}_F(L_{1/5}(m,X))\ $) is as small as possible.
This function is differentiable and so its extrema can be found by inspecting the critical points. Applying the Chain Rule and the Fundamental Theorem of Calculus to obtain the derivative with respect to $m$ gives
$$\eqalign{
\frac{\partial}{\partial m}\mathbb{E}_F(L_\alpha(m,X))&=\alpha\left(0-\int_m^\infty dF(x)\right) - (1-\alpha)\left(0 - \int_{-\infty}^m dF(x)\right)\\
&= F(m) - \alpha.
}$$
For continuous distributions this always has a solution $m$ which, by definition, is any $\alpha$ quantile of $X$. For non-continuous distributions this might not have a solution but there will be at least one $m$ for which $F(x)-\alpha\lt 0$ for all $x\lt m$ and $F(x)-\alpha\ge 0$ for all $x\ge m$: this also (by definition) is an $\alpha$ quantile of $X$.
Finally, because $\alpha\ne 0$ and $\alpha\ne 1$, it is clear that neither $m\to-\infty$ nor $m\to\infty$ will minimize this loss. That exhausts the inspection of the critical points, showing that $\Lambda_\alpha$ fits the bill.
As a special case, $\mathbb{E}_F(2L_{1/2}(m,X)) = \mathbb{E}_F\left(\left|m-x\right|\right)$ is the loss exhibited in the question.
The "probability loss" function has sometimes been called the "linear score" in the literature. Although it looks appealing, this loss function is improper, which means that it does not set the incentive to forecast the true probability that $y_i = 1$. For details, see p. 366 of Gneiting and Raftery ("Strictly Proper Scoring Rules, Prediction, and Estimation", Journal of the American Statistical Association, 2007).
In practice, impropriety means that a silly forecaster (who, for example, skews his probabilities towards the extremes of zero and one) may obtain a better probability loss than a reasonable forecaster.
The following example, based on R code, illustrates this point.
- First, set a random seed (
set.seed(1)
) and fix a sample size (n <- 10000
)
- Simulate an arbitrary vector of true probabilities:
p_true <- runif(n)
- Now, draw a vector of binary observations which follow these probabilities:
y <- runif(n) < p_true
- Suppose Anne is an omniscient forecaster, and knows the true probabilities
p_true
. Her probability loss (on average over cases) can be computed as follows: loss_Anne <- (sum(p_true[y == FALSE]) + sum((1-p_true)[y == TRUE]))/n
- By contrast, consider a second forecaster (Bob) who makes overconfident predictions, according to the following formula:
p_wrong <- 0.5*(p_true + (p_true >= 0.5))
. The formula means that Bob skews the "small" probabilities (less than 50 percent) towards zero, and the "large" probabilities (more than 50 percent) towards one. Bob's average loss is given by loss_Bob <- (sum(p_wrong[y == FALSE]) + sum((1-p_wrong)[y == TRUE]))/n
Running this code on my PC, I find that loss_Anne
is about $0.33$, whereas loss_Bob
is about $0.29$. Thus, a perfect forecaster (Anne) loses to an overconfident forecaster who deliberately skews his probabilities towards the extremes of zero and one.
Thus, the probability loss should not be used for model comparison, as it will generally not select the true model (even asymptotically). Instead, a strictly proper scoring function like the logarithmic loss or Brier score should be used. Again see the reference mentioned above.
Best Answer
The state-of-the-art reference on the matter is [1]. Essentially, it shows that all the loss functions you specify will converge to the Bayes classifier, with fast rates.
Choosing between these for finite samples can be driven by several different arguments:
[1] Bartlett, Peter L, Michael I Jordan, and Jon D McAuliffe. “Convexity, Classification, and Risk Bounds.” Journal of the American Statistical Association 101, no. 473 (March 2006): 138–56. doi:10.1198/016214505000000907.