As Thorny said, Milnor's axiomatic definition seems to be precisely the best way of proving that different definitions are the same. The main thrust of his "definition" is the proof that any invariants that satisfy these axioms must be the same as Stiefel-Whitney classes. In his book, they connect the two notions I describe below as well as the Steenrod-squares definition. They should also serve to prove that all the definitions you talk about are the same.
The rest of this answer might have less to do with your exact question than with my tendency to see an interesting question title and start writing. Sorry! Still, I feel that they are things that should be said (or, at least, don't deserve to be deleted).
I think there are two very important ways to understand characteristic classes. Both are explained in Milnor's Characteristic Classes, but not as the definition, since they are not as precise (but, to me, they are much more intuitive).
Think of your vector bundle as a map from your space X into a Grassmanian. The cohomology of the Grassmanian (more precisely, either the $\mathbb Z/2$ cohomology of the real Grassmanian, or the usual cohomology of the complex Grassmanian) is a polynomial algebra on some generators. The characteristic classes (Stiefel-Whitney or Chern, respectively) are precisely the pullbacks of these cohomology classes to X via the map.
Reading your question carefully, I guess you already knew this. Still, I think you should give this definition more credit. In particular, I think that this is the best explanation of the philosophical reason why "characteristic classes" exist. On thing that confuses me: why are the pullbacks of the integer cohomology of the real Grassmanian never called characteristic classes? I'm sure they are a pain to calculate, but that doesn't justify why nobody seems to care for them at all...
You can understand them through obstruction theory (another reference: Steenrod's "Theory of Fibre Bundles). The idea is to generalize the definition of the Euler characteristic using vector fields. Namely, try to construct a nowhere-zero section of your bundle. The obstruction will be a cohomology class, which is called the Euler class (and corresponds mod 2 to the top Stiefel-Whitney class). Try to construct two linearly independent nowhere-zero sections of the bundle. The obstruction will be a cohomology class which, mod 2, will be the next (one dimension lower) Stiefel-Whitney class. If you keep going like this, you'll construct all the classes.
Here is an explanation of why the obstructions to constructing non-zero sections are cohomology classes, for the case of a single section.
Think of your space X as a CW-complex; start constructing it on the 0-skeleton, and then try extending the section to 1-skeleton, and so on. At each step, you will basically be solving the following problem:
Given a vector field on the boundary $S^{n-1}$ of the ball $B^n$, can you extend it to the whole ball?
To solve this, think of the vector field as a map $S^{n-1}\to \mathbb R^m$ where m is the dimension of your bundle (you can assume that the bundle is trivial over the ball $B^n$ since the ball is contractible). Since the vector field is supposed to be nowhere zero, you can think of this as a map $S^{n-1}\to S^{m-1}$. If $n<m$, this map is always nullhomotopic and always extends to the ball. If $n=m$, you get an integer, the degree of the map, which tells you if you can extend. Since you get an integer for each degree-m cell of the CW-complex, you get something that looks like a cohomology class in $H^m(X)$ (of course, you need to verify separately that it actually is one, and if you are precise enough, you'll see that these integers only make sense mod 2). This is the Euler class.
If you wanted to construct two linearly independent sections, first construct one up to the $n-1$-skeleton (which is always possible). Now, let's start making the second one. You might as well require the second section to be orthogonal to the first. So, in the extension problem, you'll have a map $S^{n-1}\to \mathbb R^{m-1}$ where the $\mathbb R^{m-1} \subset \mathbb R^m$ is the subspace orthogonal to the first section. Since it also can't be zero, it's really a map $S^{n-1}\to S^{m-2}$. The rest of the argument is the same; you get a class in $H^{m-1}(X)$.
Usual disclaimer: there may be mistakes anywhere. Please point them out!
I think the easiest way to understand the Bockstein spectral sequence is through the exact couple coming from the long exact sequence of cohomology associated to $0\to\mathbb Z\to\mathbb Z\to \mathbb Z/2\to0$. This shows first that indeed the first differential is $Sq^1$ and tells you that the next page is the direct sum of the cokernel and kernel (shifted one step) of multiplication by $2$ on $2H^\ast(X,\mathbb Z)$. Hence it is like what you would get from applying the universal coefficient formula to $2H^\ast(X,\mathbb Z)$ (instead of $H^\ast(X,\mathbb Z)$). When each cohomology group $H^\ast(X,\mathbb Z)$ is finitely generated this means concretely that you "keep" each $\mathbb Z$-factor (as well as odd torsion) and downgrade each $\mathbb Z/2^n$ to $\mathbb Z/2^{n-1}$.
In particular the difference between the dimension of $H^n(X,\mathbb Z/2)$ and that of the $Sq^1$-cohomology is equal to the number of $\mathbb Z/2$-factors in $H^n(X,\mathbb Z)$ and $H^{n+1}(X,\mathbb Z)$.
I found a reference to Q2. In Madsen, Milgram: The classifying spaces for surgery and cobordism of manifolds, Ann of Math Studies 92 where they refer to Browder: Torsion in H-spaces, Ann of Math 74 for the Bockstein s.s. of $K(\mathbb Z_{(2)},n)$ and $K(\mathbb Z/2,n)$. The Madsen-Milgram book also contains other examples of computations with the Bss.
Best Answer
Here's one way to understand them. The external cup square $a \otimes a \in H^{2n}(X \times X)$ of $a \in H^n(X)$ induces a map $f:X \times X \to K(Z_2, 2n)$. It can be show that this map factors through a map $g:(X \times X) \times_{Z_2} EZ_2 \to K(2n)$, where $Z_2$ acts on the product by permuting the factors and $EZ_2$ can be taken to just be $S^\infty$. If you unravel what this means, it says that our original map $f$ was homotopic to the map obtained by first switching the coordinates and then applying $f$. It also says that this homotopy, when applied twice to get a homotopy from $f$ to itself, is homotopic to the identity homotopy, and we similarly have a whole series higher "coherence" homotopies. Now $X \times BZ_2$ maps to $(X \times X) \times_{Z_2} EZ_2$ as the diagonal, so we get a map $X \times BZ2 \to K(2n)$. But $BZ_2$'s cohomology is just $Z_2[t]$, so this gives a cohomology class $Sq(a) \in H^*(X)[t]$ of degree $2n$. If we write $Sq(a)=\sum s(i) t^i$, it can be shown that $s(i)=Sq^{n-i}a$.
What does this mean? Well, if our map $f$ actually was invariant under switching the factors (which you might think it ought to be, given that it appears to be defined symmetrically in the two factors), we could take $g$ to be just the projection onto $X \times X$ followed by $f$. This would mean that $Sq(a)$ comes from just projecting away the $BZ_2$ and then using $a^2$, i.e. $Sq^n(a)=a^2$ and $Sq^i(a)=0$ for all other $i$. Thus the nonvanishing of the lower Steenrod squares somehow measures how the cup product, while homotopy-commutative (in terms of the induced maps to Eilenberg-MacLane spaces), cannot be straightened to be actually commutative. Indeed, in the universal example $X=K(Z_2,n)$, the map $f$ is exactly the universal map representing the cup product of two cohomology classes of degree $n$.
Some somewhat terse notes on this can be found here; see particularly part III.(Sorry, the link is now dead.)