TL;DR
I recommend using LIPO. It is provably correct and provably better than pure random search (PRS). It is also extremely simple to implement, and has no hyperparameters. I have not conducted an analysis that compares LIPO to BO, but my expectation is that the simplicity and efficiency of LIPO imply that it will out-perform BO.
(See also: What are some of the disavantage of bayesian hyper parameter optimization?)
Bayesian Optimization
Bayesian Optimization-type methods build Gaussian process surrogate models to explore the parameter space. The main idea is that parameter tuples that are closer together will have similar function values, so the assumption of a co-variance structure among points allows the algorithm to make educated guesses about what best parameter tuple is most worthwhile to try next. This strategy helps to reduce the number of function evaluations; in fact, the motivation of BO methods is to keep the number of function evaluations as low as possible while "using the whole buffalo" to make good guesses about what point to test next. There are different figures of merit (expected improvement, expected quantile improvement, probability of improvement...) which are used to compare points to visit next.
Contrast this to something like a grid search, which will never use any information from its previous function evaluations to inform where to go next.
Incidentally, this is also a powerful global optimization technique, and as such makes no assumptions about the convexity of the surface. Additionally, if the function is stochastic (say, evaluations have some inherent random noise), this can be directly accounted for in the GP model.
On the other hand, you'll have to fit at least one GP at every iteration (or several, picking the "best", or averaging over alternatives, or fully Bayesian methods). Then, the model is used to make (probably thousands) of predictions, usually in the form of multistart local optimization, with the observation that it's much cheaper to evaluate the GP prediction function than the function under optimization. But even with this computational overhead, it tends to be the case that even nonconvex functions can be optimized with a relatively small number of function calls.
A widely-cited paper on the topic is Jones et al (1998), "Efficient Global Optimization of Expensive Black-Box Functions." But there are many variations on this idea.
Random Search
Even when the cost function is expensive to evaluate, random search can still be useful. Random search is dirt-simple to implement. The only choice for a researcher to make is setting the the probability $p$ that you want your results to lie in some quantile $q$; the rest proceeds automatically using results from basic probability.
Suppose your quantile is $q = 0.95$ and you want a $p=0.95$ probability that the model results are in top $100\times (1-q)=5$ percent of all hyperparameter tuples. The probability that all $n$ attempted tuples are not in that window is $q^n = 0.95^n$ (because they are chosen independently at random from the same distribution), so the probability that at least one tuple is in that region is $1 - 0.95^n$. Putting it all together, we have
$$
1 - q^n \ge p \implies n \ge \frac{\log(1 - p)}{\log(q)}
$$
which in our specific case yields $n \ge 59$.
This result is why most people recommend $n=60$ attempted tuples for random search. It's worth noting that $n=60$ is comparable to the number of experiments required to get good results with Gaussian Process-based methods when there are a moderate number of parameters. Unlike Gaussian Processes, the number of queries tuples does not change with the number of hyperparameters to search over; indeed, for a large number of hyperparameters, a Gaussian process-based method can take many iterations to make headway.
Since you have a probabilistic characterization of how good the results are, this result can be a persuasive tool to convince your boss that running additional experiments will yield diminishing marginal returns.
LIPO and its Variants
This is an exciting arrival which, if it is not new, is certainly new to me. It proceeds by alternating between placing informed bounds on the function, and sampling from the best bound, and using quadratic approximations. I'm still working through all the details, but I think this is very promising. This is a nice blog write-up, and the paper is Cédric Malherbe and Nicolas Vayatis "Global optimization of Lipschitz functions."
Best Answer
Changing the loss, changes the problem, so you can't objectively compare using one loss with other.
As for your idea of using a hybrid between L1 and L2 loss, Huber loss, does this, but it's the other way around, preferring L2 loss when the difference is small and L1 loss otherwise.