[Math] Proving MAE(mean absolute error) argmin is median using functionals

machine learningstatistics

For example, I was able to find that, using functionals, the squared mean error argmin is $[y]$:

$f^* = argmin_f E_{x,y \sim p*(x,y)}(y-f(x))^2$

gives us after deriving with respect to f:

$\int_y(-2y + 2f(x))p(y|x) = 0$, and thus:

$2E[Y] – 2f(x)*1 = 0$, which leads to $f^*(x) = E[Y]$.

However, I can't use the same logic to find the functional for:

$f^* = argmin_f E_{x,y \sim p*(x,y)}|y-f(x)|$

I know it's not differentiable at its minima, but is there any way to use functionals to motivate the minima as the median?

Thanks.

Best Answer

As far as I know you can indeed use derivatives to justify the median as the minimizer of the absolute value loss. Considering $X\sim F$, where $F$ has density $f$ for simplicity, we are interested in \begin{align*} \min_{c \in \mathbb{R}} \mathbb{E}[|X-c|] = \min_{c \in \mathbb{R}} \left( -0.5\int_{-\infty}^{c} (x-c) f(x)dx + 0.5 \int_{c}^{\infty} (x-c) f(x)dx \right). \end{align*} Now we can try to get to our first order condition. When the variable (here $c$) is in the integral boundary this can be done via the Leibniz rule: https://en.wikipedia.org/wiki/Leibniz_integral_rule. Taking derivatives with respect to $c$ then leads to the first order condition \begin{align} 0 = 0.5 \int_{-\infty}^{c} f(x)dx - 0.5 \int_{c}^{\infty} f(x)dx = 0.5 F(c) - 0.5 (1-F(c)) = F(c) - 0.5, \end{align} where I assume that we can interchange limits and integrals. The last equation now is now equivalent to $c = F^{-1}(0.5)$ showing that the median is a stationary point. To show that the median is actually the minimum you can consider the function $g(c) = \mathbb{E}[|X-c|]$ and show that it is convex, which follows from the convexity of $|x|$.

While you put in the machine learning tag, this type of reasoning can be utilized in a couple of statistical contexts. It is the starting point for Quantile Regression, where the loss function is generalized to get arbitrary quantiles $F^{-1}(\alpha)$ for $\alpha \in (0,1)$. Statistical functionals (often defined via minimization of expected loss) are a standard starting point for statistics in general, but are specifically studied in Robust Statistics.


Edit: convexity in itself is not enough to guarantee the existence of a minimizer. As in real analysis coercivity (or a comparable condition) needs to be invoked. For the case discussed above Theorem 6.8 in E.L. Lehmann. "Theory of Point Estimation". John Wiley & Sons, 1983, is sufficient.

Related Question