This requires only the inverse of the Cayley transform. Start with
$$
(U-I)=(A-iI)(A+iI)^{-1}-(A+iI)(A+iI)^{-1}=-2i(A+iI)^{-1}.
$$
It follows that $\mathcal{N}(U-I)=\{0\}$ and $\mathcal{R}(U-I)=\mathcal{D}(A)$. Similarly,
$$
(U+I) = 2A(A+iI)^{-1} = iA(U-I).
$$
Let $U=\int_{T}\lambda dF(\lambda)$, and, for each $0 < \delta < \pi$, define $G_{\delta}$ to be the characteristic function of the arc $\{ e^{i\theta} : \theta \in [\delta,2\pi-\delta]\}$. Then
$$
P_{\delta} = \int G_{\delta}(\lambda)dF(\lambda)
$$
is a projection with $P_{\delta}x \in \mathcal{D}(A)$ because
$$
Q_{\delta}=\int_{T} G_{\delta}\frac{1}{\lambda-1}dF(\lambda)
$$
is bounded and $(U-I)Q_{\delta}=P_{\delta}$ implies that the range of $P_{\delta}$ is in $\mathcal{R}(U-I)=\mathcal{D}(A)$. Furthermore,
$$
iAP_{\delta} = iA(U-I)Q_{\delta}=(U+I)Q_{\delta}=\int_{T}G_{\delta}(\lambda)\frac{\lambda+1}{\lambda-1}dF(\lambda) \\
AP_{\delta} = \int_{T}i\frac{1+\lambda}{1-\lambda}G_{\delta}(\lambda)dF(\lambda).
$$
Because $x \in \mathcal{D}(A)$ iff $x = (U-I)y$ for some $y$, then, for all $x \in \mathcal{D}(A)$, one has
$$
\begin{align}
P_{\delta}Ax & = P_{\delta}A(U-I)y \\
& =-iP_{\delta}(U+I)y\\
& =-i(U+I)P_{\delta}y \\
& = A(U-I)P_{\delta}y \\
& = AP_{\delta}(U-I)y = AP_{\delta}x
\end{align}
$$
If $x \in \mathcal{D}(A)$, then
$$
Ax = \lim_{\delta\downarrow 0}P_{\delta}Ax=\lim_{\delta\downarrow 0}AP_{\delta}x
= \lim_{\delta\downarrow 0}\int i\frac{1+\lambda}{1-\lambda}G_{\delta}(\lambda)dF(\lambda)x.
$$
Therefore, by the monotone convergence theorem, if $x\in\mathcal{D}(A)$, then
$$
\begin{align}
\|Ax\|^{2} & =\lim_{\delta\downarrow 0}\|AP_{\delta}x\|^{2} \\
& = \lim_{\delta\downarrow 0}\int \left|\frac{1+\lambda}{1-\lambda}\right|^{2}|G_{\delta}(\lambda)|^{2}d\|F(\lambda)x\|^{2} \\
& = \int \left|\frac{1+\lambda}{1-\lambda}\right|^{2}d\|F(\lambda)x\|^{2} < \infty.
\end{align}
$$
Conversely if the last integral on the right is finite for some $x$, then the following limit exists in $X$:
$$
y = \lim_{\delta\downarrow 0}\int i\frac{1+\lambda}{1-\lambda}G_{\delta}(\lambda)dF(\lambda)x = \lim_{\delta\downarrow 0}AP_{\delta}x.
$$
Then, because $\lim_{\delta}P_{\delta}x=x$ exists, and because $A$ is closed, it follows that $x\in\mathcal{D}(A)$ and $Ax=y$. Finally, one concludes that
$$
x \in \mathcal{D}(A) \iff \int \left|\frac{1+\lambda}{1-\lambda}\right|^{2}d\|F(\lambda)x\|^{2} < \infty.
$$
And, in that case,
$$
Ax = \lim_{\delta\downarrow 0}\int i \frac{1+\lambda}{1-\lambda}G_{\delta}(\lambda)dF(\lambda)x.
$$
Change of variables: The final step is a change of variables. Define a new spectral measure $E$ on $\mathbb{R}$ by $E(S)=F(\{ \frac{t-i}{t+i} : t\in S\})$. It follows that $\frac{t-i}{t+i}=\lambda$ gives $t=i\frac{1+\lambda}{1-\lambda}$. So,
$$
x \in \mathcal{D}(A) \iff \int_{-\infty}^{\infty}t^{2}d\|E(t)x\|^{2} < \infty,
$$
and, for any such $x$, the following exists as an improper integral:
$$
Ax = \int_{-\infty}^{\infty}tdE(t)x,\;\; x \in \mathcal{D}(A).
$$
The proof of the spectral theorem for normal operators doesn't rely on the proof of the spectral theorem for self-adjoint operators, instead the proofs are basically identical.
How do you construct the spectral measure in the self-adjoint case? One way to do it is to look at the $C^*$-algebra generated by the self-adjoint operator $T$ on the Hilbert space $X$, let's call it $C^*(T)$. Since $C^*(T)$ is commutative, by Gelfand theory it is isomorphic to the algebra of continuous functions on the spectrum of $T$, $C(\sigma(T))$. Given $x,y\in H$, the map $C^*(T)\to\mathbb C$ given by $S\mapsto \langle Sx,y\rangle$ is a bounded linear functional, hence defines a Borel measure $\mu_{x,y}$ on $\mathbb R$, supported in $\sigma(T)$. Using these measures, we can extend the isomorphism $C(\sigma(T))\to C^*(T)$ to a homomorphism of $B(\mathbb R)\to \mathcal B(X)$ from the algebra bounded Borel functions on $\mathbb R$ to bounded operators on $X$. The spectral measure is just the restriction of this homomorphism to characteristic functions of Borel sets.
If now $T$ is normal, $C^*(T)$ is still commutative, and (again by Gelfand theory) is isomorphic to $C(\sigma(T))$, where now $\sigma(T)\subset\mathbb C$. Given $x,y\in X$, the measure $\mu_{x,y}$ is now a Borel measure on $\mathbb C$ supported in $\sigma(T)$, and in this way we obtain a homomorphism $B(\mathbb C)\to\mathcal B(X)$ from the algebra of bounded Borel functions on $\mathbb C$ to $\mathcal B(X)$, and obtain the spectral measure.
The rest of the proof of the spectral theorem should be the same.
EDIT
Hopefully this will help translate my response to language you are familiar with.
Firstly, yes, $C^*(T)$ is as you have defined it.
Secondly, basically the only difference between the two cases is that if $T$ is normal, we define the map $\Phi_0$ from polynomials in two variables $p=p(z,\overline z)$ to $B(X)$ by $\sum_{ij}a_{ij}z^i\overline z^j\mapsto \sum_{ij}a_{ij}T^i(T^*)^j$ and extend this by Stone-Weierstrass to a map $\Phi:C(\sigma(T))\to B(X)$. We need to consider bivariate polynomials in the normal case because if the set $X\subset\mathbb C$ is not a subset of $\mathbb R$, polynomials in one variable are not closed under conjugation, hence the Stone-Weierstrass theorem cannot be applied.
Thirdly, there are plenty of books out there that prove the spectral theorem for normal operators, leaving the case for self-adjoint operators as a corollary, but most of the one's I'm familiar with develop some basic $C^*$-algebra theory to make the proofs more transparent. See for instance Conway's or Rudin's functional analysis books, or Murphy's $C^*$-algebras and operator theory.
Best Answer
The main reason for posting this was to answer it, thus collecting all this stuff in a single place for future reference -- and present too.
The first item on this proof is that a linear operator on a finite-dimensional complex vector space admits an upper triangular representation. This is proved by induction on $n:=\dim V$, $V$ being the vector space. If it is 1D, the proof is trivial. Suppose $\dim V=n>1$ and the theorem holds for dimensions up to $n-1$. We know our operator $T$ has an eigenvalue. Indeed, consider $v,Tv,T^2v,T^3v,\dotsc,T^nv$. Those cannot be linearly independent if $v\neq0$, since they are $n+1$ and $\dim V=n$. So there exist $a_i\in\mathbb{C}$ such that: $$\sum_{i=1}^nT^iva_i=0.$$ Let $m$ be the largest index such that $a_m\neq0$. THis is not 0, since $v\neq0$. Factor the polynomial: $$a_0+a_1z+\dotso+a_mz^m=c(z-\lambda_1)\cdot\dotso\cdot(z-\lambda_m).$$ Substituting $T$ for $z$, and applying to $v$, we find: $$0=\left(\sum_{i=1}^ma_iT^i\right)v=c(T-\lambda_1I)\cdot\dotso\cdot(T-\lambda_mI)v,$$ so $T-\lambda_iI$ is not injective for some $i$. But this equates to $\lambda_i$ being an eigenvector, since not injective iff has nontrivial kernel iff $(T-\lambda_iI)v=0$ for some $v\neq0$ iff $\lambda_iv=Tv$ i.e. $\lambda_i$ is an eigenvector. So going back to our original $T$, consider any eigenvalue $\lambda$. $T-\lambda I$ is not injective, but by nullity+rank we have $T-\lambda I$ is not surjective. If $U=\mathrm{Im}(T-\lambda I)$ is the range of that operator, then $\dim U<\dim V$. Also, $U$ is invariant under $T$ since: $$Tu=(T-\lambda I)u+\lambda u,$$ and if $u\in U$ then both summands are in $U$. So $T|_U$ is an operator on $U$, and by induction there exists a basis of $U$ such that $T$ is represented by an upper triangular matrix w.r.t that basis. So if $k:=\dim U$ and that basis is $\{u_1,\dotsc,u_k$, then $Tu_j$ is in the span of $u_1,\dotsc,u_j$ for all $j\leq m$. Extend that basis to a basis of $V$ by adding extra vectors $v_1,\dotsc,v_{n-k}$. $Tv_i$ is in the span of $u_1,\dotsc,u_k$ for all $i\leq n-k$, thus in that of $u_1,\dotsc,u_k,v_1,\dotsc,v_i$. And this gives us upper triangularity of the matrix representing $T$ w.r.t. $u_1,\dotsc,u_k,v_1,\dotsc,v_{n-k}$, QED.
The rest of this answer is practically copied off this pdf. First of all, notice how $T$, a linear operator, is uniquely determined by the values of $\langle Tu,v\rangle$ for $u,v\in V$. That is because the inner product is positive definite, so if $S$ satisfies $\langle Tu,v\rangle=\langle Su,v\rangle$ for all $u,v\in V$, we first conclude $\langle(T-S)u,v\rangle=0$ for all $u,v\in V$, but fixing $u$ this means $(T-S)u=0$, and that holds for all $u$, hence $T-S=0$ or $T=S$. This makes it sensible to define an operator via: $$\langle Tu,v\rangle=\langle u,T^\ast v\rangle,$$ for all $u,v\in V$. $T^\ast$ is uniquely determined as seen above, and is called the adjoint of $T$ w.r.t this inner product. Elementary properties of the operation of taking the adjoint are that $(S+T)^\ast=S^\ast+T^\ast$, $(aS)^\ast=\bar aS^\ast$ for the complex case, the identity is self-adjoint (i.e. coincides with its adjoint), adjoining is an involution (i.e. $(T^\ast)^\ast=T$), $M(T^\ast)=M(T)^\ast$ in the complex case, denoting by $^\ast$ the conjugate transpose of a matrix, and $(ST)^\ast=T^\ast S^\ast$. The linked pdf also proves the eigenvalues of a self-adjoint operators are all real, but this is irrelevant here, so I will leave the proof to that pdf. We define normal operators as those for which $TT^\ast=T^\ast T$, i.e. those commuting with their adjoints. The polarization identity is another interesting result I leave to the pdf. One result we will use is that, when $\|v\|=\sqrt{\langle v,v\rangle}$, then $\|Tv\|=\|T^\ast v\|$ for any $v$ if $T$ is normal The proof is immediate: \begin{align*} T\text{ is normal}\iff{}&TT^\ast-T^\ast T=0\iff\langle(TT^\ast-T^\ast T)v,v\rangle=0\quad\forall v\in V\iff{} \\ {}\iff{}&\langle T^\ast Tv,v\rangle=\langle TT^\ast v,v\rangle\quad\forall v\in V\iff{} \\ {}\iff{}&\|T^\ast v\|^2=\langle T^\ast v,T^\ast v\rangle=\langle Tv,Tv\rangle=\|Tv\|^2. \end{align*}
As is subsequently proved, this implies that if $T$ is normal the kernel of $T,T^\ast$ coincide, the eigenvalues of $T^\ast$ and $T$ are mutually conjugate, and that distinct eigenvalues are associated to orthogonal eigenvectors, which is in fact true in general.
Now the big result: unitary diagonalizability equates to normality. This statement is of course equivalent to proving an operator $T$ is normal iff it admits an orthonormal eigenbasis, since any change of basis is unitary. So let us assume $T$ is normal. We know any operator can be represented by an upper triangular matrix w.r.t. some basis. We take that basis and show the corresponding matrix representation of $T$, $M(T)$, is in fact diagonal. This makes use of the Pythagorean theorem, proved here, and of the norm identity we proved a while ago relating the norm of an image via $T$ to that via $T^\ast$. By definition, if $M(T)=(a_{ij})_{i,1=1}^n$, we have $Te_i=a_{ii}e_i$,and since $M(T^\ast)=M(T)^\ast$ we also know $T^\ast e_i=\sum_i^n\bar a_{ik}e_k$. So by the Pythagorean theorem and the norm identity: $$|a_{ii}|^2=\|Te_i\|^2=\|T^\ast e_i\|^2=\sum_{k=i}^n\|a_{ik}|^2,$$ implying those for $k\neq i$ are all zero terms. The above holds for any $i$, proving $M(T)$ is diagonal.
Now suppose $M(T)$ is diagonalizable w.r.t. some orthonormal eigenbasis. $M(T^\ast)=M(T)^\ast$, so $T^\ast$ is also diagonalizable. Indeed, they are both diagonalizable w.r.t. the same basis, since the eigenvalues are mutually conjugate and the eigenvectors coincide. But we know $M(TT^\ast)=M(T)M(T^\ast)$, so: $$M(TT^\ast)=M(T)M(T^\ast)=M(T)M(T)^\ast=M(T)^\ast M(T)=M(T^\ast)M(T)=M(T^\ast T),$$ since diagonal matrices always commute. Thus, $T^\ast T=TT^\ast$, for if the matrix representations w.r.t. some basis coincide it means the two have the same images for any vector, and thus coincide. So if $T$ is diagonalizable, $T$ is normal.
Update
I just realised the proof implicitly uses the fact that if the quadratic form associated to an operator is zero then the operator is zero, i.e. $\langle Tv,v\rangle\,\,\forall v\in V\implies T=0$. This is proved here on p. 147: