Russell's paradox
In Zermelo set theory, the proof of the titular question is straightforward:
- Assume there is such a set. Call it $R$.
- Fact: $x \notin x$ if and only if $x \in R$. This is the defining property of $R$.
- Assume $R \in R$.
- By the fact, this means $R \notin R$.
- Contradiction!
- Therefore $R \notin R$.
- By the fact, this means $R \in R$.
- Contradiction!
- Therefore no such set exists.
There is an immediate corollary: there is no set of all sets.
- Assume there is a set of all sets. Call it S.
- There is a subset $R \subseteq S$ containing exactly those sets $x$ for which that $x \notin x$
- Contradiction!
- Therefore, there is no set of all sets.
Rationale for Zermelo set theory
One of the most important features of a set theory is having tools to actually construct sets. Cantor's 'naive' set theory had the most powerful rule of all: if you could name any property $P$, then there was a set of all sets that have property $P$. This let you construct any set you could image! Unfortunately, it lets you construct the set of Russell's paradox, and thus Cantor's set theory is self contradictory.
Zermelo took a more modest approach*: he looked for a more conservative collection of constructions that sufficed for mathematics, but isn't so strong as to create any of the known paradoxical sets. Fraenkel added another useful construction, and gave us the axiom of foundation which simplifies technical arguments.
Among the constructions of Zermelo set theory is the restricted form of Cantor's "comprehension principle": if we have any property $P$ and a set $S$, then we can form the subset of $S$ of things satisfying property $P$.
The axiom of restricted comprehension exactly the property of a universe of sets that is needed to make the argument in the opening section.
*: I do not know if this is historically accurate. Really, I'm espousing an a postiori observation about it.
Classes
Set-builder notation is very useful notation to denote sets. Recall that each of the following notations define sets in ZFC:
$$ \{ x \in S \mid P(S) \} \qquad \qquad \{ f(x) \mid x \in S \} \qquad \qquad \{ a, b \} $$
where $a,b,S$ are all sets, $P$ is a unary predicate whose domain includes $S$, and $f$ is a function whose domain includes $S$.
The same notation turns out to be quite useful to define predicates. For example, predicate
P(x) = "x contains the empty set"
is easily notated as
$$ P = \{ x \mid \emptyset \in x \} $$
and the assertion that $x$ satisfies the predicate $P$ can be written as
$$ x \in P. $$
This notation, formally, has nothing to do with sets: it is alternative notation for logic. When we do this, we call a predicate a "class".
The way you manipulate logic in the form of classes is so strikingly similar to the way you manipulate sets that this unified notation is extremely useful.
To answer a question you had, the only objects are still sets. The only thing that can be a member of a set is a set. The only thing that can be a member of a class is a set. Classes can't be members of anything, because they aren't objects: they're logic. (at least, if we stick to first-order logic....)
It can be technically awkward when you hav0e to pay attention to what is a set and what is a class, especially if you want to reason in a 'stripped down' version of formal logic.
So, Von Neumann, Bernays, and Gödel invented (NBG) set theory*. The objects of NBG set theory are classes. It might be a little confusing to use the same word as we did for the alternative view of logic above; however in practice it's not a problem.
NBG set theory includes a class called $\mathbf{Set}$. $V$ is another commonly used name for this class. There is a theorem/axiom that says if $x \in y$, then $x \in \mathbf{Set}$.
NBG can also be presented (and usually is, I think) as a theory with two sorts: a sort of sets and a sort of classes. Only sets may be elements of things. But for any set there is a class that has the same elements, and it is reasonable to conflate the two.
*: Again, this is not meant to be a historically accurate presentation.
Universes
Another approach to dealing with classes is a Grothendieck universe. However, using them requires assuming a large cardinal axiom.
A Grothendieck universe is, briefly, a set $U$ with the property that the elements of $U$ collectively have good enough properties to be justifiably called a 'universe of sets'. We call the elements of $U$ "small sets". The things we would normally call classes are all subsets of $U$.
In this way (other than having had to assume a large cardinal axiom) we don't have to do much that is special -- everything we are talking about is a set. We just occasionally have to take note of which sets are "small" and which are not.
Best Answer
Your intuition about size limitations is wrong. Think about finite sets: there are sets which are finite but as large as you would want them, $6$ elements, $25$ elements, $216$ elements, whatever. But does that mean that the set of natural numbers is a finite set?
The idea behind the transfinite is what happens after you've gone to infinity and beyond. So there are sets and they grow larger and larger, then they become infinite, and they continue to grow larger and larger... eventually you have gone "all the way". There comes a question - is the collection of everything you have accumulated so far is a set? If so, we can keep on going. Classes tell us that eventually (which is a pretty far eventually) we have to stop somewhere.
In the naive approach to mathematics we think that every collection we can talk about is a set. Simply because in the naive approach there is no definition of a set.
However once axiomatic set theory came into play we have the seemingly circular definition: Sets are elements of a model of set theory.
For example, one of the axioms about sets is that they have power sets. One of the theorems linking a set and its power set is that there is no surjective function from a set to its power set.
Suppose the collection of all sets, $V$ was a set itself, what would its power set be? Well, every subcollection of $V$ is a set and therefore in $V$. This means that $P(V)\subseteq V$. However this means that there is a surjective function from $V$ onto its power set!
Cantor's paradox (as above), as well Russell's paradox (all sets which are not elements of themselves is a collection which is not a set), and so several other paradoxes tell us one thing: not all collections we can define are sets.
In ZFC classes are simply definable collections of sets. What does it mean definable? It means "all sets which have a property which we can describe in the given language".
One simple way to describe the difference between sets and classes in ZFC is that sets are elements of other sets. Classes are not elements of any other class, so if $A\in B$ then $A$ is a set.
To your edit:
The first thing to want from a foundational mathematical theory (one which you hope to later build most of your mathematics on) is that if you have a certain property, then you can talk about all the things in your universe with this property. The various paradoxes tell us that in ZFC (and in its spawns) some of these collections are not sets. The notion of "proper class" tells us that we can still talk about this collection, but it is not a set per se.
For example, we can talk about ordinals (which are a transfinite generalization of the natural numbers in some sense), the collection of all ordinals is a proper class. We can still talk about "all the ordinals" or prove that some property holds for all of them, despite that this is not a set.