Since $\int fd\mu < \infty$, we can wlog assume $f$ is of $L^1$
Since $\int f_n d\mu \to \int f d\mu$ and $\int fd\mu < \infty$, for sufficiently large $n$'s we have $\int f_n d\mu <\infty$.
Hence, wlog, assume that all $f_n$ and $f$ are of $L^1$.
Now, by Scheffé's lemma, we have $\int |f_n - f| d\mu \to 0$.
Since $\int_E |f_n -f| d\mu \leq \int |f_n - f| d\mu$ for each measuralbe $E$, we are done.
Question 1.
For any measurable functions $f,g:X\to[0,\infty]$, it is a standard exercise (which you should definitely attempt by yourself; otherwise there are certainly questions about this on this site too) that $[f\leq g]:=\{x\in X\,|\,f(x)\leq g(x)\}$ is measurable (from this it follows that $[f\geq g], [f<g], [f>g], [f=g]$ are all measurable). Do you see how this applies to your case?
Question 2.
Yes, you can take limits in $(5)$. For every $c<1$, we have $\alpha\geq c\int_X s\,d\mu$. So, take the limit as $c\to 1^-$ on both sides. You should have definitely seen in a basic analysis course that inequalities are preserved under limits.
Question 3.
From $(6)$ to $(7)$ it's simply applying the definition of the Lebesgue integral as a supremum. We have defined $\alpha$ in the beginning, and what $(6)$ is showing is that for every simple $0\leq s\leq f$, we have $\int_Xs\,d\mu\leq \alpha$. In other words, the set of numbers $\left\{\int_Xs\,d\mu\,:\, \, \text{$0\leq s\leq f$ is simple}\right\}$ has $\alpha$ as an upper bound. Thus by definition of supremum, we have that
\begin{align}
\sup\left\{\int_Xs\,d\mu\,:\, \, \text{$0\leq s\leq f$ is simple}\right\} \leq \alpha.
\end{align}
But the LHS is none other than $\int_Xf\,d\mu$, by definition.
Question 4.
The inequality $\lim\limits_{n\to \infty}\int_Xf_n\,d\mu =\alpha \leq \int_Xf\,d\mu$ is a very trivial consequence of monotonicity of the integral. The other direction is not so trivial, because the definition of $\int_Xf\,d\mu$ involves a huge supremum over all simple functions. Also, up to this point in the treatment, the "only" thing we really know about integrals are basic facts (theorem 1.24), and the only integrals we can explicitly calculate are those of simple functions. The idea is therefore to somehow reduce the more complicated problem of proving the reverse inequality to something more familiar.
By definition of the Lebesgue integral using supremum, showing that $\int_Xf\,d\mu\leq \alpha$ is thus equivalent to showing that for every simple $0\leq s\leq f = \lim f_n$, we have $\int_Xs\,d\mu \leq \alpha$. It would be nice if we could say something like "because $s$ is smaller than $f$ and since $f_n$ increases to $f$, thus for large $n$, we have $s\leq f_n$". If we could say this, then we have $\int_Xs\,d\mu \leq \int_Xf_n\,d\mu$ for all large $n$, and hence by taking limits, $\int_Xs\,d\mu \leq \alpha$, thereby completing the proof.
Unfortunately, this isn't quite right. We CANNOT deduce that for large $n$, that $s\leq f_n$. Thus, what can we do? We give ourselves some room for error, by introducing the scaling factor $0<c<1$. Then, $cs$ will be an overall scaled down function, and for this, we can obtain a nice control: $cs\leq f_n$ on $E_n$, and $\bigcup E_n = X$. So by Rudin's argument, we thus deduce $c\int_Xs\,d\mu \leq \alpha$. Then, finally, we let $c\to 1^-$.
This final idea of "giving yourself some room for error" is a VERY VERY VERY common idea in analysis. If you've studied baby Rudin (or Spivak or any other introductory book), you must have surely seen such ideas. The simplest example of this I can think of is that given $z\in \Bbb{C}$, we have $z=0$ if and only if for every $\epsilon>0$, we have $|z|\leq \epsilon$. These are equivalent statements, but sometimes, the second statement is easier to prove, because you have an $\epsilon$ amount of wiggle room to establish the inequality $|z|\leq \epsilon$.
Best Answer
Update: I have Rudin's Principles of Mathematical Analysis (PMA) and Real and Complex Analysis (RCA) in front of me now and I have to agree with you - there is no need to use a simple function $s$ instead of $f$ itself in PMA's proof of the monotone convergence theorem since the heavy lifting with simple functions has already been done in theorem 11.24. But the interesting part is comparing this to RCA. There the proof is identical and the theorems corresponding to PMA's 11.3 and 11.24 are 1.19(d) and 1.25, respectively, with one big difference: 1.25 is only stated for nonnegative measurable simple functions whereas PMA's 11.24 was stated for nonnegative measurable functions. Therefore RCA's proof has to use simple functions.
So, unless someone else can spot why one would still need to use simple functions in PMA's proof, I'm assuming that professor Rudin simply used the same proof in both books and for some reason didn't streamline it for PMA. The proof is still correct, of course.
I have the third edition of both books and it would be interesting to know if the proof is the same in earlier editions. After all, according to Wikipedia, Rudin wrote PMA first, so the proof in RCA couldn't have influenced his decisions for the first edition. In any case, a very nice observation by you.
I'm leaving my original answer below so that your comment still makes sense and hopefully it could still serve as food for thought for some.
If you were to replace $s$ by $f$ you would be using circular reasoning. See the line $$\int_E f_n \, \mathrm{d}\mu \geq \int_{E_n} f_n \, \mathrm{d}\mu \geq c\int_{E_n}s \, \mathrm{d}\mu = c\int_E s1_{E_n} \, \mathrm{d}\mu,$$ where $1_{E_n}$ is the indicator (or characteristic) function, would become $$\int_E f_n \, \mathrm{d}\mu \geq \int_{E_n} f_n \, \mathrm{d}\mu \geq c\int_{E_n}f \, \mathrm{d}\mu = c\int_E f1_{E_n}\, \mathrm{d}\mu.$$ But we can't yet conclude $$\lim_{n \to \infty} c\int_E f1_{E_n}\, \mathrm{d}\mu = c\int_E f \, \mathrm{d}\mu$$ because $f1_{E_n}$ is not a simple function and we are in the process of proving the monotone convergence theorem that would justify this conclusion. We have to use simple functions since for them this convergence is true by the definition of the integral.