First, most mathematicians don't really care whether all sets are "pure" -- i.e., only contain sets as elements -- or not. The theoretical justification for this is that, assuming the Axiom of Choice, every set can be put in bijection with a pure set -- namely a von Neumann ordinal.
I would describe Bourbaki's approach as "structuralist", meaning that all structure is based on sets (I wouldn't take this as a philosophical position; it's the the most familiar and possibly the simplest way to set things up), but it is never fruitful to inquire as to what kind of objects the sets contain. I view this as perhaps the key point of "abstract" mathematics in the sense that the term has been used for past century or so. E.g. an abstract group is a set with a binary relation: part of what "abstract" means is that it won't help you to ask whether the elements of the group are numbers, or sets, or people, or what.
I say this without having ever read Bourbaki's volumes on Set Theory, and I claim that this somehow strengthens my position!
Namely, Bourbaki is relentlessly linear in its exposition, across thousands of pages: if you want to read about the completion of a local ring (in Commutative Algebra), you had better know about Cauchy filters on a uniform space (in General Topology). In places I feel that Bourbaki overemphasizes logical dependencies and therefore makes strange expository choices: e.g. they don't want to talk about metric spaces until they have "rigorously defined" the real numbers, and they don't want to do that until they have the theory of completion of a uniform space. This is unduly fastidious: certainly by 1900 people knew any number of ways to rigorously construct the real numbers that did not require 300 pages of preliminaries.
However, I have never in my reading of Bourbaki (I've flipped through about five of their books) been stymied by a reference back to some previous set-theoretic construction. I also learned only late in the day that the "structures" they speak of actually get a formal definition somewhere in the early volumes: again, I didn't know this because whatever "structure-preserving maps" they were talking about were always clear from the context.
Some have argued that Bourbaki's true inclinations were closer to a proto-categorical take on things. (One must remember that Bourbaki began in the 1930's, before category theory existed, and their treatment of mathematics is consciously "conservative": it's not their intention to introduce you to the latest fads.) In particular, apparently among the many unfinished books of Bourbaki lying on the shelf somewhere in Paris is one on Category Theory, written mostly by Grothendieck. The lack of explicit mention of the simplest categorical concepts is one of the things which makes their work look dated to modern eyes.
I am very much used to these kind of questions. Are 2-categories useful? What can one prove using gerbes? Why should I care about stacks?
I think a funny way to react to these kind of questions, with often surprising results, is to return the question:
If you want to know what higher X is good for, explain first what X is good for, in your opinion.
And whatever the person answers, I found it mostly very easy to generalize the given argument from X to higher X.
Example 1 If X is "category", a common answer is "it keeps track of the automorphisms of the objects". Well, a 2-category keeps track of the automorphisms of automorphisms.
Example 2 The question was: "What can you prove with gerbes?", so I'll reply: "What can you prove with bundles?". People are often completely puzzled by this question, so they'll accept that a notion may be useful even if it's not there to prove something.
Best Answer
I agree 100% with Igor and Andrew L., on the benefit of reading the creator's version of the same thing available from later expositors. I have gained mathematical insights from reading Euclid, Archimedes, Riemann, Gauss, Hurwitz, Wirtinger, as well as moderns like Zariski.... on topics I already thought I understood.
Just Euclid's use of the word "measures" for "divides" finally made clear to me the elementary argument that the largest number dividing 2 integers is also the smallest positive number one can measure using both of them. This is clear thinking of (commensurable) measuring sticks, since by translating it is obvious the set of lengths that one can so measure are equally spaced, hence the smallest one would measure them all.
I was unaware also that Euclid's characterization of a tangent line to a circle was not just that it is perpendicular to the radius, but is the only line meeting the circle locally once and such that changing its angle ever so little produces a second intersection, i.e. Newton's definition of a tangent line. It is said Newton read Euclid just before giving his own definition.
I did not realize until reading Archimedes that the "Cavalieri principle" follows just from the definition of the Riemann integral, without needing the fundamental theorem of calculus. I.e. it follows just from the definition of a volume as a limit of approximating slices, and was known to Archimedes. Hence one can conclude all the usual volume formulas for pyramids, cones, spheres, even the bicylinder, just by starting from the decomposition of a cube into three right pyramids, applying Cavalieri to vary the angle of the pyramid, then approximating and using Cavalieri. It is an embarrassment to me that I had thought the volume of a bicylinder a more difficult calculus problem that that for a sphere, when it follows immediately from comparing horizontal slices of a double square based pyramid inscribed in a cube. I.e. by Cavalieri and the Pythagorean theorem, the volume of a sphere is the difference between the volumes of a cylinder and an inscribed double cone. The same argument shows the volume of a bicylinder is the difference between the volumes of a cube and an inscribed double square based pyramid. This led to an intuitive understanding of the simple relation between the volumes of certain inscribed figures that I then noticed had been recently studied by Tom Apostol.
I realized this summer that this allows a computation of the volume of the 4 dimensional ball. I.e. this ball results from revolving half a 3 ball, hence can be calculated by revolving a cylinder and subtracting the volume of revolving a cone. Since Archimedes knew the center of gravity of both those solids he knew this.
Having read everywhere that Hurwitz's theorem was that the maximum number of automorphisms of a Riemann surface of genus $g$ is $84(g-1),$ I had a difficult proof that the maximum number in genus $5$ is $192,$ using Jacobians, Prym varieties, and classifications of representations of planar groups, until Macbeath referred me to Hurwitz' original paper where a complete list of the possible orders was easily given: $84(g-1), 48(g-1),\ldots$ I subsequently explained this easy argument to some famous mathematical figures. Sometime later a more complicated such example for which Macbeath himself was usually credited was found also to occur in the 19th-century literature.
Having studied Riemann surfaces all my life, but unable to read German well, I thought I had acquired some grasp of the Riemann Roch theorem, in particular I thought Riemann had given only an inequality $\ell(D) ≥ 1-g + \deg(D).$ When the translation from Kendrick press became available, I learned he had written down a linear map whose kernel computed $\ell(D),$ and the estimate derived from the fundamental theorem of linear algebra. The full equality also follows, but only if one can compute the cokernel as well. That cokernel of course was already shown by him to be what we now call $H^1(D).$ Hence Riemann's original theorem was the so called "index" version of RR. Since he expressed his map in terms of path integrals, it was natural to evaluate those integrals by residue calculus as Roch did. This is explained in my answer to "why is Riemann Roch [not precisely] an index problem?" Although there are many fine modern expositions of Riemann Roch, the most insightful perhaps being that in the chapter on Riemann surfaces in Griffiths and Harris, I had not seen how simple it was until reading Riemann.
Perhaps this is only historical knowledge, but reading Riemann one sees that he also knew completely how to prove (index) Riemann Roch for algebraic plane curves, without appealing to the questionable Dirichlet principle, hence the usual impression that a rigorous proof had to await later arguments of Clebsch, Hilbert, or Brill and Noether, is incorrect.
Reading Wirtinger’s 19th century paper on theta functions, even though unfortunately for me only available in the original German, I learned that when a smooth Riemann surface acquires a singularity, the elementary holomorphic differential with a non zero period around that vanishing cycle, becomes meromorphic, and that period becomes the residue at the singular point. At last this explains clearly why one defines "dualizing differentials" as one does, in algebraic geometry.
Once as grad student in Auslander's algebraic geometry class, I vowed to try out Abel's advice and read the master Zariski's paper on the concept of a simple point. I was very discouraged when several hours passed and I had managed only a few pages. Upon returning to class, Auslander began to pepper us with questions about regular local rings. I found out how much I had learned when I answered them all easily until he literally told me to be quiet, since I obviously knew the subject cold. (To be honest, I did not know the very next question he posed, but I was off the hook.)
In my answer to a question about where to learn sheaf cohomology I have given an example of insight only contained in Serre's original paper.
The sense of wonder and awe one gets upon reading people like Riemann or Euler, is also quite wonderful. Any student who has struggled to compute the sum of the even powers of the reciprocals of natural numbers $1/n^{2k},$ will be amazed at Euler's facile accomplishment of this for many values of $k.$ Calculus students estimating $\pi$ by the usual series to 3 or 4 places will also be impressed at his scores of correct digits. On the other hand, anyone using a modern computer can detect an actual error in his expansion of $\pi,$ I forget where, in the 214th place? but an error which was already noticed long ago.
As you can see these are elementary examples hence from a fairly naive and uneducated person, myself, who has not at all plumbed the depth of many original papers. But these few forays have definitely convinced me there is a benefit that cannot be gained elsewhere, as these exposures can transform the understanding of ordinary mortals closer to that of more knowledgeable persons, at least in a narrow vein. So while it might be thought that only the strongest mathematicians can attempt these papers, my advice would be that reading such masters may be even more helpful to us average students.
As a remark on criterion 2 of the original question, I find it is not at all necessary to read all of a paper by a master to get some insight. One word in Euclid enlightened me, and before the translation came out, I had already gained most of my understanding of Riemann's argument for RR just from reading the headings of the paragraphs. I learned a proof of RR for plane curves from reading only the introduction to a paper of Fulton. A single sentence of Archimedes, that a sphere is a cone with vertex at the center and base equal to the surface, makes it clear the volume is $1/3$ the surface area. Moreover this shows the same ratio holds for a bicylinder, whereas the area of a bicylinder is considered so difficult we do not even ask it of calculus students. So one should not be discouraged by the difficulty of reading all of a masters' paper, although of course it wouldn't hurt.
A remark on the definition of master, versus creator. There are cases where a later master re - examines an earlier work and adds to it, and in these cases it seems valuable to read both versions. In addition to examples given above of Newton generalizing Euclid and Mumford using Hilbert, perhaps Mumford's demonstration of the power of Grothendieck's Riemann Roch theorem in calculating inavriants of moduli space of curves is relevant.
A related question occurs in many cases since the classical arguments of the "ancients" are preserved but only in classical texts such as Van der Waerden in algebra, and newer books have found slicker methods to avoid them. E.g. the method of LaGrange resolvents is useful in Galois theory for proving an extension of prime degree in characteristic zero is radical. There are faster less precise methods of showing this such as Artin/Dedekind's method of independence of characters, but the older method is useful when trying to use Galois theory to actually write down solution formulas of cubics and quartics. Thus today we often have an intermediate choice of reading modern expositions which reproduce the methods of the creators, or ones that avoid them, sometimes losing information. (This is discussed in the math 844-2 algebra notes on my web page, where, being a novice, I give all competing methods of proof.)