Solved – What do think about switching from L2-loss to L1-loss when inner difference became less then 1

gradient descentloss-functionsmachine learning

Square of number < 1.0 makes that number smaller. So the gradient became smaller and learning very slow. So for me, it looks naturally to use L2 loss when the difference between target and output is more than one and use L1 when it is < 1.0. The same is true for example for Mean Relative Squared Error there we don't need to take square when a relative error is less than one.
I have tested my thought on a real time series prediction dataset with different gradient descent optimizers(Adam, vanilla, Adadelta) and that gave me much faster and better convergence. What do you think. Is this a dataset-specific or should be general practice? Error surface became not smooth and not convex when using two losses, but the same is true for ReLu and that is not a problem. What are the potential dangers of this approach?

Best Answer

Changing the loss, changes the problem, so you can't objectively compare using one loss with other.

As for your idea of using a hybrid between L1 and L2 loss, Huber loss, does this, but it's the other way around, preferring L2 loss when the difference is small and L1 loss otherwise.