1)
First-order methods have many variants, for example using conjugate directions instead of steepest descent direction (Conjugate Gradient Method).
There is also a multitude of "line search" algorithms for computing step-length in the first order methods. These include binary search algorithms (Gold section), Newton and quasi-Newton methods: The second order method is used to compute step length in first order method. It seems weird, but the point is that first order method can work in multidimensions, while the second-order line search is used only on univariate objective function (line in step direction).
In theory, you can also use numerical derivatives, i.e. compute them from cost function. This allows you to use first order methods without analytical representation for derivatives. Methods for computing derivatives include Finite difference approximation, Neville's method, Ridder's method and Automatic differentiation.
Second-order methods like Gauss-Newton and Levenberg-Marquardt do not use explicit Hessian:
$$H\approx J^{\mathbb{T}}J$$
The reason behind this approximation (which does not require explicit second-order derivatives) would be outside scope of the answer, but using such Hessian tend to be more numerically stable because second-order terms are noise-sensitive.
2)
There are many methods for derivative-free optimization, some are designed for large and sparse datasets, like the black-box cost function you have. Such methods include: Model-based methods, Coordinate and Pattern-search Methods, Conjugate-Direction Method, Nelder-Mead Method and Implicit Filtering.
Maybe a proper design of your cost function will allow you to omit unknown data without harming the result too much.
The following books helped me greatly and both go into detail when it comes to high-fidelity optimization (large and sparse datasets, unknown derivatives):
Jorge Nocedal, Stephed J. Wright: "Numerical Optimization, Second Edition"
Press, Teukolsky, Vetterling, Flannery: "Numerical Resipes, Third Edition"
You can apply the algorithm in Ilmari's answer recursively. Sort along some direction $x$, and for each value of $x$, beginning with the best, add the $(n-1)$-dimensional Pareto frontier (computed recursively) of the items with that value of $x$ to the $n$-dimensional Pareto frontier, then ignore or remove all items dominated by the added items.
Best Answer
The basic definition of the Pareto frontier is that it consists of exactly those alternatives that are not dominated by any other alternative. We say that an alternative $A$ dominates $B$ if $A$ outscores $B$ regardless of the tradeoff between value and cost — that is, if $A$ is both better and cheaper than $B$.
Obviously, both the best and the cheapest alternative always belong to the Pareto frontier, and in fact, they are its endpoints. A simple algorithm to find the other alternatives (if any) on the Pareto frontier is to first sort the alternatives according to one of the objectives — say, cost. One then starts with the cheapest alternative (which, as noted, always belongs in the Pareto frontier) and skips successive alternatives in order of increasing cost until one finds one with a higher value. This alternative is then added to the frontier and the search is restarted from it.
A step-by-step description of the algorithm, assuming that $A_1, \dotsc, A_n$ are the alternatives in increasing order of cost, goes like this:
Algorithm A:
Addendum
In the comments below, Nupul asked about computing the Pareto frontier for combinations of alternatives.
There are two natural and interesting cases: one where the combinations are sets of alternatives (i.e. each alternative may appear at most once in a combination), and one where they are multisets (which can include the same alternative more than once). I'll address both cases below, but let me first give an alternative algorithm for computing the Pareto frontier for single alternatives:
Algorithm B:
In step 3, to check whether $A_j$ is dominated by any alternative in $P$, it will suffice to find the most expensive alternative $B \in P$ which is still cheaper than $A_j$. (Of course, by symmetry, we could also choose $B$ as the least valuable alternative in $P$ that has higher value than $A_j$.) If $B$ does not dominate $A_j$, neither can any other alternative in $P$.
Algorithm B is somewhat more complicated than algorithm A above (in particular, algorithm A runs in $O(n)$ time if the alternatives are already sorted, whereas algorithm B needs $O(n \log n)$ time in any case), but it turns out to generalize better.
Now, let's consider sets of zero or more alternatives. Obviously, we could just enumerate all such sets and then apply algorithm A or B, but this would be very inefficient: for $n$ alternatives, the number of sets is $2^n$. Instead, we can adapt algorithm $B$ to construct the combinations on the fly:
Algorithm C:
Again, as in algorithm B, we don't need to compare $C^*$ to every combination in $P$; it's enough to check whether $C^*$ is dominated by the most expensive combination in $P$ that is cheaper than $C^*$.
What about multiset combinations? In this case, obviously, the Pareto frontier $P$ contains infinitely many combinations: in particular, for any combination $C \in P$, the combination $C + A_1$, where $A_1$ is the alternative with the lowest cost/value ratio, also belongs in $P$.
However, the number of times any other alternative can appear in a non-dominated combination must be bounded. (The proof is a bit tedious, but follows from simple geometrical considerations.) Therefore we only need to consider the finite set $P_0$ of those combinations on the Pareto frontier which do not include the alternative $A_1$; the remaining combinations on the frontier are obtained by adding some number of $A_1$s to those combinations.
For multiset combinations, we also have the following useful lemma:
Lemma: Any combination that contains a dominated sub-combination must itself be dominated.
In particular, this means that combinations in $P$ can only include alternatives that themselves belong in $P$. Thus, as a first step, we can use algorithm A (or B) to compute the Pareto frontier for single alternatives and discard any alternatives that are not part of it.
For a complete algorithm to compute $P_0$, the following definition turns out to be useful:
Dfn: A combination $B$ is said to $A$-dominate $C$ if the combination $B + nA$ dominates $C$ for some non-negative integer $n \in \mathbb N$. Equivalently, $B$ $A$-dominates $C$ iff $\operatorname{cost}(C) > \operatorname{cost}(B) + n\,\operatorname{cost}(A)$, where $n = \max \left(0, \left\lfloor \frac{\operatorname{value}(C)-\operatorname{value}(B)}{\operatorname{value}(A)} \right\rfloor \right)$.
Using this definition, we can apply the following algorithm:
Algorithm D:
I think this algorithm should be correct, but to be honest, I'm not 100% sure that there aren't any mistakes or gaps in my proof sketch. So please test it thoroughly before using it for anything that matters. :-)
I also think it should be possible to implement step 2 efficiently in a manner similar to algorithms B and C, but the situation is complicated by the fact that the rejection condition is $A_1$-domination rather than simple domination. Of course, if $B$ dominates $C$, then it trivially $A$-dominates $C$ for any $A$, so one can at least first do a quick check for simple dominance.
Another way to optimize step 2 is to note that if $C + A_i$ is $A_1$-dominated by any combination in $P_0$, then so will be $D + A_i$ for any combination $D$ that includes $C$ as a sub-combination. In particular, we can skip to step 4 as soon as we find that $\emptyset + A_i = A_i$ itself is $A_1$-dominated by some combination already in $P_0$.