You are asking a question about the expected value of sample information.
First, I suggest that you not define your loss function in terms of variance, at least not without further examination. Instead, identify why you care about error in your estimation procedure. What are the different ways of being wrong? What are the consequences of being wrong in each of those ways? Can you quantify those consequences in terms of units of value that make sense to you?
Then define your loss function as cost-of-testing + E[cost-of-error]. To describe losses in this way, you will need to specify a common unit of measure in which both kinds of loss are denominated. Explicitly or implicitly, your loss function will need to incorporate a specification of the "price" at which you would willingly trade off costs of testing for costs of error. This quantification may seem artificial: it may be that there is no obvious common unit in which both kinds of losses might be measured. Too bad: any loss function will define such terms of trade, either explicitly or implicitly. (In terms of intermediate microeconomics: a level set of the loss function defines an indifference curve in the space of arguments; the price is the slope of this indifference curve.) You are better off making this choice consciously, rather than as the logical consequence of an un-examined ad hoc assumption.
In my opinion (and I am right about this), a problem in statistical decision theory (which this is) is handled most naturally in a Bayesian framework, in which we treat unobserved parameters as random variables. Formally, let $\theta$ denote the true value of the variable whose value you're trying to estimate, e.g., the argument at which your search algorithm would find its true minimum. For any other value $\hat{\theta}$ in your parameter space, let $C(\hat{\theta}, \theta)$ denote the cost of error when $\hat{\theta}$ is used as an estimated value and $\theta$ is the true value. It could be that the cost of error depend only on the size of the difference (somehow defined) between the estimated and true values, e.g. $C(\hat{\theta}, \theta) = (\hat{\theta}- \theta)^2$, where the minus sign designates some measure of distance in parameter space.
Let $f(\theta)$ describe your beliefs about the likelihood that $\theta$ might take on various different values, expressed as a probability distribution over the parameter space. Let $f(\theta \mid x_{1:n})$ describe the posterior probability over $\theta$ that would obtain if your sample consisted of the values $x_{1:n}$. After you've done your sampling, you can compute your expected cost of error as a function of the sample you drew: $E[C \mid x_{1:n}] = \int \int C(\hat{\theta}, \theta) f(\hat{\theta} \mid x_{1:n}) \, d\hat{\theta} \, f(\theta \mid x_{1:n}) \, d\theta$, where both integrals are taken over the parameter space. Your total expected loss is then $E[C \mid x_{1:n}] + cn$.
If you are allowed to let choose $n$ using a stopping rule, rather than specifying a choice in advance, you can employ dynamic programming to choose $n$ optimally. This answer to a previous question should offer useful guidance.
If you have to specify $n$ in advance, then you would employ preposterior analysis. Intuitively, you choose $n$ so that in expectation, the marginal benefit of the last test just offsets the cost of the last test. Formally, Let $f(\theta)$ describe your prior beliefs about the distribution of $\theta$, before any sampling has been done. Let $l(x_{1:n} \mid \theta)$ denote the likelihood of drawing a specific sample $x_{1:n}$, if $\theta$ were the true value of the parameter. Then before you've taken your first sample, you can already formulate an expression for your expected cost of error:
$\int \int E[C \mid \mathbf{x}] \, l(\mathbf{x} \mid \theta) \, f(\theta) \, d\mathbf{x} \, d\theta.$
Here, the inner integral is taken over the sample space, the outer integral is taken over the parameter space. You would choose $n$ to minimize the above expression plus $cn$, the cost of testing.
Choosing $n$ dynamically is better, if possible, because you can equate the marginal benefit of the last test to its cost in every realization, rather than merely in expectation.
You already know a lot. Two observations.
Take linear regression. Minimizing the squared error turns out to be equivalent to maximizing the likelihood. Loosely one could say that minimizing the squared error is an intuitive method, and maximizing the likelihood a more formal approach that allows for proofs using properties of for example the normal distribution. The outcomes can overlap.
Second minimizing or maximizing is often AFAIK arbitrary. Minimizing the negative is the same as maximizing the positive. There are a lot of routines that are written in the minimization mode: this is sort of coincidence. For reasons of parsimony/readability this has become standard.
Best Answer
TL;DR
I recommend using LIPO. It is provably correct and provably better than pure random search (PRS). It is also extremely simple to implement, and has no hyperparameters. I have not conducted an analysis that compares LIPO to BO, but my expectation is that the simplicity and efficiency of LIPO imply that it will out-perform BO.
(See also: What are some of the disavantage of bayesian hyper parameter optimization?)
Bayesian Optimization
Bayesian Optimization-type methods build Gaussian process surrogate models to explore the parameter space. The main idea is that parameter tuples that are closer together will have similar function values, so the assumption of a co-variance structure among points allows the algorithm to make educated guesses about what best parameter tuple is most worthwhile to try next. This strategy helps to reduce the number of function evaluations; in fact, the motivation of BO methods is to keep the number of function evaluations as low as possible while "using the whole buffalo" to make good guesses about what point to test next. There are different figures of merit (expected improvement, expected quantile improvement, probability of improvement...) which are used to compare points to visit next.
Contrast this to something like a grid search, which will never use any information from its previous function evaluations to inform where to go next.
Incidentally, this is also a powerful global optimization technique, and as such makes no assumptions about the convexity of the surface. Additionally, if the function is stochastic (say, evaluations have some inherent random noise), this can be directly accounted for in the GP model.
On the other hand, you'll have to fit at least one GP at every iteration (or several, picking the "best", or averaging over alternatives, or fully Bayesian methods). Then, the model is used to make (probably thousands) of predictions, usually in the form of multistart local optimization, with the observation that it's much cheaper to evaluate the GP prediction function than the function under optimization. But even with this computational overhead, it tends to be the case that even nonconvex functions can be optimized with a relatively small number of function calls.
A widely-cited paper on the topic is Jones et al (1998), "Efficient Global Optimization of Expensive Black-Box Functions." But there are many variations on this idea.
Random Search
Even when the cost function is expensive to evaluate, random search can still be useful. Random search is dirt-simple to implement. The only choice for a researcher to make is setting the the probability $p$ that you want your results to lie in some quantile $q$; the rest proceeds automatically using results from basic probability.
Suppose your quantile is $q = 0.95$ and you want a $p=0.95$ probability that the model results are in top $100\times (1-q)=5$ percent of all hyperparameter tuples. The probability that all $n$ attempted tuples are not in that window is $q^n = 0.95^n$ (because they are chosen independently at random from the same distribution), so the probability that at least one tuple is in that region is $1 - 0.95^n$. Putting it all together, we have
$$ 1 - q^n \ge p \implies n \ge \frac{\log(1 - p)}{\log(q)} $$
which in our specific case yields $n \ge 59$.
This result is why most people recommend $n=60$ attempted tuples for random search. It's worth noting that $n=60$ is comparable to the number of experiments required to get good results with Gaussian Process-based methods when there are a moderate number of parameters. Unlike Gaussian Processes, the number of queries tuples does not change with the number of hyperparameters to search over; indeed, for a large number of hyperparameters, a Gaussian process-based method can take many iterations to make headway.
Since you have a probabilistic characterization of how good the results are, this result can be a persuasive tool to convince your boss that running additional experiments will yield diminishing marginal returns.
LIPO and its Variants
This is an exciting arrival which, if it is not new, is certainly new to me. It proceeds by alternating between placing informed bounds on the function, and sampling from the best bound, and using quadratic approximations. I'm still working through all the details, but I think this is very promising. This is a nice blog write-up, and the paper is Cédric Malherbe and Nicolas Vayatis "Global optimization of Lipschitz functions."