Solved – What do think about switching from L2-loss to L1-loss when inner difference became less then 1

gradient descentloss-functionsmachine learning

Square of number < 1.0 makes that number smaller. So the gradient became smaller and learning very slow. So for me, it looks naturally to use L2 loss when the difference between target and output is more than one and use L1 when it is < 1.0. The same is true for example for Mean Relative Squared Error there we don't need to take square when a relative error is less than one.
I have tested my thought on a real time series prediction dataset with different gradient descent optimizers(Adam, vanilla, Adadelta) and that gave me much faster and better convergence. What do you think. Is this a dataset-specific or should be general practice? Error surface became not smooth and not convex when using two losses, but the same is true for ReLu and that is not a problem. What are the potential dangers of this approach?

Best Answer

Changing the loss, changes the problem, so you can't objectively compare using one loss with other.

As for your idea of using a hybrid between L1 and L2 loss, Huber loss, does this, but it's the other way around, preferring L2 loss when the difference is small and L1 loss otherwise.

TL;DR

I recommend using LIPO. It is provably correct and provably better than pure random search (PRS). It is also extremely simple to implement, and has no hyperparameters. I have not conducted an analysis that compares LIPO to BO, but my expectation is that the simplicity and efficiency of LIPO imply that it will out-perform BO.

Bayesian Optimization

Bayesian Optimization-type methods build Gaussian process surrogate models to explore the parameter space. The main idea is that parameter tuples that are closer together will have similar function values, so the assumption of a co-variance structure among points allows the algorithm to make educated guesses about what best parameter tuple is most worthwhile to try next. This strategy helps to reduce the number of function evaluations; in fact, the motivation of BO methods is to keep the number of function evaluations as low as possible while "using the whole buffalo" to make good guesses about what point to test next. There are different figures of merit (expected improvement, expected quantile improvement, probability of improvement...) which are used to compare points to visit next.

Contrast this to something like a grid search, which will never use any information from its previous function evaluations to inform where to go next.

Incidentally, this is also a powerful global optimization technique, and as such makes no assumptions about the convexity of the surface. Additionally, if the function is stochastic (say, evaluations have some inherent random noise), this can be directly accounted for in the GP model.

On the other hand, you'll have to fit at least one GP at every iteration (or several, picking the "best", or averaging over alternatives, or fully Bayesian methods). Then, the model is used to make (probably thousands) of predictions, usually in the form of multistart local optimization, with the observation that it's much cheaper to evaluate the GP prediction function than the function under optimization. But even with this computational overhead, it tends to be the case that even nonconvex functions can be optimized with a relatively small number of function calls.

A widely-cited paper on the topic is Jones et al (1998), "Efficient Global Optimization of Expensive Black-Box Functions." But there are many variations on this idea.

Random Search

Even when the cost function is expensive to evaluate, random search can still be useful. Random search is dirt-simple to implement. The only choice for a researcher to make is setting the the probability $p$ that you want your results to lie in some quantile $q$; the rest proceeds automatically using results from basic probability.

Suppose your quantile is $q = 0.95$ and you want a $p=0.95$ probability that the model results are in top $100\times (1-q)=5$ percent of all hyperparameter tuples. The probability that all $n$ attempted tuples are not in that window is $q^n = 0.95^n$ (because they are chosen independently at random from the same distribution), so the probability that at least one tuple is in that region is $1 - 0.95^n$. Putting it all together, we have

$$ 1 - q^n \ge p \implies n \ge \frac{\log(1 - p)}{\log(q)} $$

which in our specific case yields $n \ge 59$.

This result is why most people recommend $n=60$ attempted tuples for random search. It's worth noting that $n=60$ is comparable to the number of experiments required to get good results with Gaussian Process-based methods when there are a moderate number of parameters. Unlike Gaussian Processes, the number of queries tuples does not change with the number of hyperparameters to search over; indeed, for a large number of hyperparameters, a Gaussian process-based method can take many iterations to make headway.

Since you have a probabilistic characterization of how good the results are, this result can be a persuasive tool to convince your boss that running additional experiments will yield diminishing marginal returns.

LIPO and its Variants

This is an exciting arrival which, if it is not new, is certainly new to me. It proceeds by alternating between placing informed bounds on the function, and sampling from the best bound, and using quadratic approximations. I'm still working through all the details, but I think this is very promising. This is a nice blog write-up, and the paper is Cédric Malherbe and Nicolas Vayatis "Global optimization of Lipschitz functions."

Solved – Why do saddle points become “attractive” in Newtonian dynamics

There's not really a more intuitive way to think about this. Suppose that you have the eigendecomposition of the Hessian for $P$ an orthonormal matrix of eigenvectors and $D$ diagonal matrix of eigenvalues. $$ \begin{align} \nabla^2 f(x) &= PDP^\top \\ \left[\nabla^2 f(x)\right]^{-1} &= PD^{-1}P^\top \end{align} $$ This is relevant to Newton's method because the update is given by $$ \begin{align} x^{(t+1)} &= x^{(t)}-\left[\nabla^2 f(x^{(t)})\right]^{-1}\nabla f(x^{(t)}) \\ &= x^{(t)}-PD^{-1}P^\top\nabla f(x^{(t)}) \end{align} $$
Saddle points have gradient 0, and Newton's method seeks points with gradient 0. If the problem is non-convex, then depending on the starting point, you may find yourself in the "basin of attraction" of the saddle point.

I also think that this post is of interest:

Gradient descent on non-convex functions
Why is Newton's method not widely used in machine learning?

Best Answer

Related Solutions

Optimization Techniques – Handling Slow to Evaluate Cost Functions

TL;DR

Bayesian Optimization

Random Search

LIPO and its Variants

Solved – Why do saddle points become “attractive” in Newtonian dynamics

Related Question