Solved – What’s the pros and cons between Huber and Pseudo Huber Loss Functions

artificial intelligenceloss-functionsmachine learningneural networks

The Huber Loss is: $$ huber =
\begin{cases}
\frac{1}{2} t^2 & \quad\text{if}\quad |t|\le \beta \\
\beta |t| &\quad\text{else}
\end{cases} $$

The pseudo huber is:
$$ pseudo = \delta^2\left(\sqrt{1+\left(\frac{t}{\delta}\right)^2}-1\right)$$

What are the pros and cons of using pseudo huber over huber? I don't really see much research using pseudo huber, so I wonder why?

For me, pseudo huber loss allows you to control the smoothness and therefore you can specifically decide how much you penalise outliers by, whereas huber loss is either MSE or MAE. Also, the huber loss does not have a continuous second derivative.

So, what exactly are the cons of pseudo if any?

Best Answer

Advantages of the Huber loss:

  1. You don't have to choose a $\delta$. (Of course you may like the freedom to "control" that comes with such a choice, but some would like to avoid choices without having some clear information and guidance how to make it.)

  2. The M-estimator with Huber loss function has been proved to have a number of optimality features. It is the estimator of the mean with minimax asymptotic variance in a symmetric contamination neighbourhood of the normal distribution (as shown by Huber in his famous 1964 paper), and it is the estimator of the mean with minimum asymptotic variance and a given bound on the influence function, assuming a normal distribution, see Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw and Werner A. Stahel, Robust Statistics. The Approach Based on Influence Functions. Hampel has written somewhere that Huber's M-estimator (based on Huber's loss) is optimal in four respects, but I've forgotten the other two. Note that these properties also hold for other distributions than the normal for a general Huber-estimator with a loss function based on the likelihood of the distribution of interest, of which what you wrote down is the special case applying to the normal distribution.

  3. I'm not saying that the Huber loss is generally better; one may want to have smoothness and be able to tune it, however this means that one deviates from optimality in the sense above.