[Math] Differentiating the L1-norm

calculusoptimization

I have been reading somewhere that gradient descent cannot be used for a non-differentiable function. And apparently the L1-norm is not differentiable at every point, specifically at the point zero. As a beginner to optimization, the reason for both these is confusing to me…

For example, let's say I have a data pair $(x, y)$, and a function $f(x)$ which depends on some parameters $m$. I want to learn the parameters which minimize a cost function.

Now let's say that this cost function $C$ is the sum of the absolute differences between each $y$ and each $f(x)$. So, $C = \sum{[f(x) – y]}$ over all data pairs. This is the same as the L1-norm.

Suppose that this is linear regression, so $f(x) = ax$. Differentiating the cost function, we get $\frac{dC}{dm}=\sum{a}$, because $\frac{df(x)}{dm} = a$.

Based on this all being correct….why is this cost function not differentiable at all points, given that apparently the derivative is simply $a$ at all points?

Best Answer

L1 optimization is a huge field with both direct methods (simplex, interior point) and iterative methods. I have used iteratively reweighted least squares (IRLS) with conjugate gradient and approximate inner iterations. In practice, the fact that the absolute value is not differentiable at the origin is not really a problem. E.g., minimizing over x, $\sum |x - x_i|$ just says x is the median of the $x_i$ since the derivatives are all +/- 1 (assuming x is not one of the $x_i$). Here is a reference. Robust methods in inverse theory JA Scales, A Gersztenkorn Inverse problems 4 (4), 1071. There is a pdf on Google Scholar. The algorithm described can be used for any Lp norm optimization for p not less than 1. They are all nonlinear except for $p=2$ but we found that using the approximate inner iterations made the L1 problem almost as fast as the least squares problem. Added: an interesting historical note is that L1 optimization was invented by Boscovich before Gauss invented least squares.