History, unfortunately, is more complicated than your choices permit.
Presumably, Zermelo had proposed his 1908 axioms in support of his well ordering theorem. In so far as his axioms might relate to intuitive notions of set, his original axioms admitted urelements and relied on singletons to substantiate the truth of equality statements. The idea that a theory of pure sets suffice for mathematics is attributed to Skolem as far as I know. With a little further consideration, this may be seen as assuming the arithmetization of mathematics. As such, one must recognize the chicken and egg problem that arises if linguistic symbols that are not pure sets become conflated with some Goedellian encoding for the sake of claiming that everything is a set.
Zermelo's reliance on singletons reflects the relationship of his system with Frege's logic. Category theorists claim that Zermelo had to make s choice between differing views expressed by Frege and Cantor. Since the relevant text is not translated into English, I cannot judge this claim. Nevertheless, Frege speaks of object identity and concept identity in a circular fashion that arises from Leibniz' principle of the identity of indiscernibles. Zermelo's use of singletons ought to be understood in this context. Foundationalism in the sense of Russell's logical atomism stipulates that object identity is fundamental. Hence, one has "first-order logic" with the proviso that "identity" is a "second-order" concept.
Since modern accounts of Zermelo-Fraenkel set theory are now based upon the first-order paradigm and the presumption that a theory of pure sets suffice for mathematics, urelements are treated in a derivative theory often called ZFA where the 'A' stands for (logical) atoms. The theory assumes that the urelements form a set (in conformance with the "everything is a set" presupposition). And, it is an important theory because of the models that arise from permuting the urelements.
It is, however, entirely plausible to understand urelements with respect to an antichain relation in a different signature.
Among other things to be found in Russell's writings, is the question of how a set is comparable to an object. While no one would ask that particular question today, the fact that urelements are not sets means that an antichain whose denial yields an order based on membership is entirely conceivable.
Such a theory would permit one to assert how non-sets satisfy the real number axioms, for example.
Understanding the distinction between urelements and sets in this manner permits one to understand how Cohen's forcing can extend a given model with "ideal subsets." The axiom of extension is contingent upon the denial of an antichain.
But, since such a view is not of interest to established mathematicians, students of set theory need to focus on the affirmative claims of first-order logic and set theory without concern for "intuition."
With regard to your second question, let me suggest that you look at the papers by Augustus de Morgan found in Ewald's two-volume anthology, "From Kant to Hilbert. " This will provide you with a good foundation for understanding what the assertion, "mathematics is formal," really means.
The received view corresponding with first-order model theory reduces mathematics to language signatures, stipulation of relations and operations involving the symbols of such signatures, and a specification of the arities for each symbol in the signature. These constructs are often called "similarity types."
Of course, the "first-order" designation goes back to the foundationalist stipulation that object identity is fundamental. So, language signatures will contain a symbol denoting an extensional domain of discourse. Extension is important to mathematical practice so that linguistic expressions intended to have "the same meaning" denote "singularly."
It had been Skolem's criticism of Zermelo's theory that gave impetus to this use of language signatures in mathematics over the notion that "mathematics is extensional." This particular claim arises from a different history. It is, of course, important for model theory. But, in the model theory of set theory it yields the "unfolding" and "geology" of possible models studied by set theorists using Cohen's forcing. What Skolem observed about Zermelo's axioms is that they could not be "categorical" in the sense of specifying a single, unique interpretation of the language symbols.
Let me address the set theoretic part of your question.
First of all, the term "universe" is a bit loaded. It can mean either (1) just a mathematical universe, understood as a universe of sets, i.e. a model of set theory; or (2) a Grothendieck universe, specifically, especially in the context of category theory, which is a model of second-order set theory, which means that it is isomorphic with $V_\kappa$ for an inaccessible cardinal $\kappa$, if you're unfamiliar with these notions, the key point to take forward is that this is a very special model of set theory which agrees with its metatheory on things like power sets, etc.
So right off the bat, we get that in the specific interpretation as a Grothendieck universe, it cannot be that every model of set theory is a universe in a larger model of set theory. But let's focus on the first meaning, simply "a model" (which raises the obvious question: why not just refer to it as a model?)
Nevertheless, these things are indeed studied, mostly in philosophy of set theory. You can read about the multiverse axioms, specifically in Hamkins' approach to the multiverse. Hamkins also has work on set theoretic potentialism, which you might want to read about.
Overall, if there is a proper class of inaccessible cardinals, you immediately get an ever growing hierarchy of universes that are even Grothendieck universes as well. This can be made stronger (e.g. require that "$\mathrm{Ord}$ is Mahlo") or weaker (e.g. require that simply every set is a transitive model of $\sf ZFC$", or even that there is a proper class of cardinals $\alpha$, such that $V_\alpha$ is a model of $\sf ZFC$; both of which are weaker than even a single inaccessible cardinal as far as consistency strength goes).
If my understanding of Grothendieck universes is correct, we actually do use them as you suggest. To make something "small", you must first envelope it by a larger universe. And since both universes agree on things like power sets, we get exactly what you're asking for.
One last thing worth pointing out, though, is that we can easily formalise all of the theories you mentioned, and more, in something like Peano arithmetic, or even Primitive Recursive Arithmetic, where there is no innate notion of "collection", rather this is meaning we assign—as people, as the readers and executors of mathematics—to the objects of set theory.
Best Answer
We mean exactly the same thing as we mean when we say $\mathbb{N}$ (the set of natural numbers) is a set or $\mathbb{R}$ (the set of real numbers) is a set, or a group is a set equipped with a binary relation satisfying some axioms, etc. etc. The definition of "group" or "ring" or "topological space" all use the word "set", and the definition of "model of ZFC" uses the word "set" in exactly the same way.
The only thing that makes the definition of "model of ZFC" more confusing than these other definitions is that if $(M,\in)$ is a model of ZFC, we think of the elements of $M$ as "sets" and the relation ${\in}\subseteq M^2$ as the "membership relation" between them. It's these elements of $M$ that Qiaochu refers to as "internal sets". The "external sets" are just the ordinary sets of mathematics.
Yes, when we take ZFC to be a foundation of mathematics, we mean that we are going to use the axioms of ZFC to reason about sets ("external sets"). ZFC does not tell us what sets are, it just gives us some rules for reasoning about them. I don't know why you call these rules "intuitive" - they are quite formal.
It means exactly the same thing as "countable" always means: There is a bijection between the underlying set of $M$ and the set of natural numbers $\mathbb{N}$. (Now $M$, being a model of ZFC, also has an internal set called "$\mathbb{N}$" and internal notions of functions, bijections, and countability for internal sets. But all of that is irrelevant, since the underlying set of $M$ is just an ordinary set, not an internal set. If there's a possibility of confusion, we would write the internal version of $\mathbb{N}$ as $\mathbb{N}^M$.)