The first question is, in what sense do you want to understand convergence of the sum?
The correct notion here is unconditional convergence, i.e. we show that there is some $h \in H$ such that for every $\epsilon > 0$, there is some finite subset $J_\epsilon \subset I$ with $\Vert h - \sum_{j\in J} a_j x_j \Vert < \epsilon$ for all finite sets $J_\epsilon \subset J \subset I$.
To this end, first note $a_i = 0$ for all but countably many $i$ (because of $(a_i)_i \in \ell^2$). Hence, let $(i_n)_n$ be pairwise distinct with $\{i_n \mid n\} = \{i \mid a_i \neq 0\}$ (if there are only finitely many, the claim is trivial).
Consider the sequence $h_N := \sum_{n=1}^N a_{i_n} x_{i_n}$. We then have (for $N \geq M \geq N_0$)
$$
\Vert h_N - h_M\Vert^2 = \Vert \sum_{n=M+1}^N a_{i_n} x_{i_n}\Vert ^2 = \sum_{n=N+1}^M |a_{i_n}|^2 \leq \sum_{n=N_0 + 1}^\infty |a_{i_n}|^2 \xrightarrow[N_0 \to \infty]{} 0
$$
where we used orthogonality of the $x_i$ and Pythagoras theorem in the step before the last one.
Hence, the sequence $(h_N)_N$ is Cauchy and thus convergent to some $h \in H$ by completeness of $H$.
It remains to prove that we indeed have unconditional convergence to this "candidate limit". To see this, let $\epsilon >0$ be arbitrary. Then there is a finite $J\epsilon \subset I$ with $\sum_{i \in I\setminus J_\epsilon}|a_i|^2<\epsilon$. Now let $J_\epsilon \subset J \subset I$ be finite. We get
$$
\Vert h-\sum_{j \in J} a_j x_j\Vert ^2=\lim_N \Vert \sum_{n=1}^N a_{i_n}x_{i_n}-\sum_{j\in J} a_j x_j\Vert^2=\lim_N \sum_{j \in J \Delta \{i_n \mid n=1\dots N} |a_j|^2,
$$
where we again used Pythagoras theorem. Here, $\Delta$ denotes the symmetric difference.
Because of $\{i_n \mid n\in \Bbb{N}\}=\{i\mid a_i \neq 0\}$, the above limit is equal to
$$
\sum_{i \in I \setminus J}|a_i|^2<\epsilon,
$$
which completes the proof.
Question 1.
For any measurable functions $f,g:X\to[0,\infty]$, it is a standard exercise (which you should definitely attempt by yourself; otherwise there are certainly questions about this on this site too) that $[f\leq g]:=\{x\in X\,|\,f(x)\leq g(x)\}$ is measurable (from this it follows that $[f\geq g], [f<g], [f>g], [f=g]$ are all measurable). Do you see how this applies to your case?
Question 2.
Yes, you can take limits in $(5)$. For every $c<1$, we have $\alpha\geq c\int_X s\,d\mu$. So, take the limit as $c\to 1^-$ on both sides. You should have definitely seen in a basic analysis course that inequalities are preserved under limits.
Question 3.
From $(6)$ to $(7)$ it's simply applying the definition of the Lebesgue integral as a supremum. We have defined $\alpha$ in the beginning, and what $(6)$ is showing is that for every simple $0\leq s\leq f$, we have $\int_Xs\,d\mu\leq \alpha$. In other words, the set of numbers $\left\{\int_Xs\,d\mu\,:\, \, \text{$0\leq s\leq f$ is simple}\right\}$ has $\alpha$ as an upper bound. Thus by definition of supremum, we have that
\begin{align}
\sup\left\{\int_Xs\,d\mu\,:\, \, \text{$0\leq s\leq f$ is simple}\right\} \leq \alpha.
\end{align}
But the LHS is none other than $\int_Xf\,d\mu$, by definition.
Question 4.
The inequality $\lim\limits_{n\to \infty}\int_Xf_n\,d\mu =\alpha \leq \int_Xf\,d\mu$ is a very trivial consequence of monotonicity of the integral. The other direction is not so trivial, because the definition of $\int_Xf\,d\mu$ involves a huge supremum over all simple functions. Also, up to this point in the treatment, the "only" thing we really know about integrals are basic facts (theorem 1.24), and the only integrals we can explicitly calculate are those of simple functions. The idea is therefore to somehow reduce the more complicated problem of proving the reverse inequality to something more familiar.
By definition of the Lebesgue integral using supremum, showing that $\int_Xf\,d\mu\leq \alpha$ is thus equivalent to showing that for every simple $0\leq s\leq f = \lim f_n$, we have $\int_Xs\,d\mu \leq \alpha$. It would be nice if we could say something like "because $s$ is smaller than $f$ and since $f_n$ increases to $f$, thus for large $n$, we have $s\leq f_n$". If we could say this, then we have $\int_Xs\,d\mu \leq \int_Xf_n\,d\mu$ for all large $n$, and hence by taking limits, $\int_Xs\,d\mu \leq \alpha$, thereby completing the proof.
Unfortunately, this isn't quite right. We CANNOT deduce that for large $n$, that $s\leq f_n$. Thus, what can we do? We give ourselves some room for error, by introducing the scaling factor $0<c<1$. Then, $cs$ will be an overall scaled down function, and for this, we can obtain a nice control: $cs\leq f_n$ on $E_n$, and $\bigcup E_n = X$. So by Rudin's argument, we thus deduce $c\int_Xs\,d\mu \leq \alpha$. Then, finally, we let $c\to 1^-$.
This final idea of "giving yourself some room for error" is a VERY VERY VERY common idea in analysis. If you've studied baby Rudin (or Spivak or any other introductory book), you must have surely seen such ideas. The simplest example of this I can think of is that given $z\in \Bbb{C}$, we have $z=0$ if and only if for every $\epsilon>0$, we have $|z|\leq \epsilon$. These are equivalent statements, but sometimes, the second statement is easier to prove, because you have an $\epsilon$ amount of wiggle room to establish the inequality $|z|\leq \epsilon$.
Best Answer