Solved – the population minimizer for Huber loss

extreme valueloss-functionspopulationprediction

Suppose we make predictions for a continuous outcome $Y$ conditional on a vector of covariates $X$. If we use mean squared error loss (MSE), the population minimizer is $\mathbf{E}[Y \;|\; X]$. If instead we use mean absolute error as our loss function (MAE), the population minimizer is the conditional median.

What is the population minimizer for Huber loss? For example, if I fit a gradient boosting machine (GBM) with Huber loss, what optimal prediction am I attempting to learn?


See The Elements of Statistical Learning (Second Edition), 2.4 Statistical Decision Theory for the population minimizers under MSE and MAE, and section 10.6 Loss Functions and Robustness for a definition of Huber loss:

$\begin{equation*}L(y, \,f(x)) = \begin{cases}(y –
f(x))^2 \text{ if } \lvert y – f(x) \rvert \leq \delta \\ 2\delta \, \lvert y – f(x) \rvert – \delta^2 \text{ otherwise}\end{cases}\end{equation*}$

Best Answer

The sample Huber estimator doesn't have closed form -- the estimates are obtained iteratively. However it's sort of analogous to a trimmed mean, at least in a particular sense.

If it were applied to the whole population it's essentially going to come down to a mean of values between two quantiles ($x_{\alpha_1}=F_X^{-1}(\alpha_1)$ and $x_{1-\alpha_2}$) but which quantiles those are depends on the particulars of the distribution and the specifics of the Huber spread estimation and value of $k$. (In the general case those quantiles won't be symmetric -- i.e. typically $\alpha_1\neq \alpha_2$).

Specifically, consider the influence function. In this case we'll look at something closely related to (and similar in appearance to) the empirical influence function, which in this example is simply the estimate itself but where the sample is taken to be a set of expected normal order statistics (or rather, approximations to them, though it hardly matters), plus an additional observation that's allowed to vary across the real line. I'll call this the influence function, but strictly speaking it would need to be adjusted (by things that won't change the general appearance). The population influence functions will be similar in general appearance.

One advantage of using the empirical function on a sample made to be about as much like we'd expect the population to look (expected quantiles) is that we can give the flavor of what's going on with something that will behave very like the population equivalent while avoiding introducing Gateaux derivatives. For details on those see the references.

With them we see how the two estimates respond to a changing observation:

Plot of empirical influence function for Trimmed mean and Huber, for particular choices of trimming fraction and Huber parameter k; both are continuous, linear in the middle but flat beyond a lower and upper threshold

These are quantitatively very similar - indicating that the two respond very similarly to a small proportion of outliers (below $\alpha$ for the trimmed mean) for a fairly symmetric sample. However, there are differences in how they respond -- if you get more extreme outliers, the Huber will effectively "trim" a greater percentage, e.g. for a very skewed sample it would act more like an asymmetric trim for example, in effect "trimming" little or nothing from the "light-tailed" side but "trimming" heavily on the heavy-tailed side and if both tails got very heavy it would act as if it was trimming more.

Here's some R code, because someone will want it:

library(MASS)

opar = par()
par(mfrow=c(1,2),cex.main=1)

xn=seq(-4,4,l=201)
x=qnorm(ppoints(19,a=3/8))
f=function(xc,x) mean(c(x,xc),trim=0.05)
f2=function(xc,x) huber(c(x,xc),k=1.95)$mu
infl=sapply(xn,f,x=x)
plot(infl~xn,type="l",xlim=c(-4,4),col="blue3",main="Trimmed mean empirical IF")
inflh=sapply(xn,f2,x=x)
plot(inflh~xn,type="l",xlim=c(-4,4),col="blue3",main="Huber empirical IF")

par(opar)

As for references, the classic ones are the books by Huber [1] and Hampel et al. [2]. There's a little on M-estimation in the first 4 pages here. The wikipedia page is a bit sparse but may help.

A caveat: a number of references claim that the influence for trimmed means redescend. As we see by actually doing it, this is not so (and it's easy to see why -- trimmed means don't completely ignore observations that are trimmed, since we count how many are each side and that continues to pull the resulting estimator no matter how far away the observation may get).

[1] Huber, Peter J. (1981), Robust statistics, New York: John Wiley & Sons, Inc., ISBN 0-471-41805-6, MR 606374.
(Republished in paperback, 2004. 2nd ed., Wiley)

[2] Hampel, Frank R.; Ronchetti, Elvezio M.; Rousseeuw, Peter J.; Stahel, Werner A. (1986), Robust statistics, Wiley