I think $(3)$ is misleading. Since $E_k$ is the set of all sample points in exactly $k$ of the $A_i$ sets, $(3)$ makes no sense at all to me.
For $(3)$ I suggest we let each $E_k$ be a disjoint union:
$$E_k = \bigcup_{i=1}^{\binom{n}{k}} D_{k,i}$$
where each $D_{k,i}$ is the subset of $E_k$ where the $k$ $A_i$'s are a distinct $k$ of the sets $A_1,\ldots,A_n$.
Then, for $(3),\;$ we assume WLOG, for any given $k,i,\;$ that $D_{k,i}$ is contained in $A_1,\ldots,A_k$ (and in no others) and proceed as I think the author intends, with $D_{k,i}$ instead of $E_k$. Note that, by the additivity of probability:
$$P(E_k) = \sum_{i=1}^{\binom{n}{k}} P(D_{k,i}).$$
The idea of $(5)$ is that we have
\begin{eqnarray*}
P\left(\bigcup_{i=1}^{n}A_i\right) &=& \sum_{k=1}^{n} P(E_k) \qquad\qquad\text{by $(1)$} \\
&=& \sum_{k=1}^{n} \sum_{i=1}^{\binom{n}{k}} P(D_{k,i}) \qquad\text{by $(3)$}\qquad\qquad\qquad\text{(*)} \\
\end{eqnarray*}
By $(3),\;$ any $P(D_{k,i})$ appears in the expression $P_1-P_2+P_3-\cdots\pm P_k,\;$ exactly $\left(k-\binom{k}{2}+\binom{k}{3}-\cdots\pm\binom{k}{k}\right)$ times, and this equals $1$ by $(4)$.
So, from $(*),\;$ we can conclude that $P(\cup_{i=1}^{n}A_i) = P_1-P_2+P_3-\cdots\pm P_n,\;$ as required.
One essential aspect of the Inclusion-Exclusion Principle (IEP) is the transformation of at least information to exact information.
If counting objects having at least a number of properties is simple, but counting objects having exactly a number of properties is difficult, than the IEP comes into play.
The objects are represented by the elements contained in $A_1,\dots,A_n$ and the properties of an element $x$ are the sets $A_j,1\leq j\leq n$, which contain $x$.
If we have this essence of IEP in mind and we look at the expression:
\begin{align*}
\left|\bigcup_{j=1}^{n}A_j\right|=\sum_{j=1}^{n}\left|A_j\right|
-\sum_{1\leq i \leq j \leq n}\left|A_i\cap A_j\right|\pm\cdots+(-1)^{n-1}\left|\bigcap_{j=1}^{n}A_j\right|
\end{align*}
we observe, that the right hand side (RHS) consists of summands with at least information.
Note, that the summand
$$\left|A_i\cap A_j\right|\quad \text{in}\quad\sum_{1\leq i \leq j \leq n}\left|A_i\cap A_j\right|$$
is not only the number of elements which are in $A_i$ and $A_j$ it is more precisely the number of elements which are at least in $A_i$ and $A_j$, since elements $x$ in $A_i\cap A_j$ may also be contained in other sets of $A_1,\dots,A_n$.
Whereas the LHS
$$\left|\bigcup_{j=1}^{n}A_j\right|$$
presents the number of elements which are exactly in $\bigcup_{j=1}^{n}A_j$.
We observe the IEP transforms counting information with at least properties into counting information with exact properties.
Note: This intuitive connection between at least and exact information has a formal represention. Using generating functions you will get a kind of eye-birds view at hand, which transforms the at least to an exact information as simple shift by one of the argument. For more information about this approach you may have a look at section 4.2 of H.S. Wilfs Generatingfunctionology.
Best Answer
The explanation goes back to the origins of Boolean algebra. Define a "Boolean" variable $\, x \,$ as one where $\, x^2 = x. \,$ For these variables $\, 1-x \,$ denotes logical negation and $\, xy \,$ denotes logical and. Finally, by De Morgan's laws, $\ 1-(1-x)(1-y) \,$ denotes logical or.
In the context of subsets of a univeral set $\, U, \,$ any subset $\, A \,$ of $\, U \,$ is identified by its indicator function $\, A(x) \,$ which is Boolean valued. Now the number of elements of $\, A \,$ is $\, |A| = \sum A(x) .\,$ Next, the indicator function of $\, A \cap B \,$ is $\, A(x)B(x), \,$ of complement of $\, A \,$ is $\, 1-A(x), \,$ and of $\, A \cup B \,$ is $\, 1-(1-A(x))(1-B(x)). \,$
Given any subsets $\, A_1, A_2, \dots, A_n \,$ and their indicator functions $\, A_1(x), A_2(x), \dots, A_n(x), \,$ we can write $\, 1-(1-A_1(x))(1-A_2(x)) = A_1(x) + A_2(x) - A_1(x)A_2(x) \,$ which implies $\, |A_1 \cap A_2| = |A_1| + |A_2| - |A_1 \cup A_2|. \,$ This obviously generalizes to any finite number of subsets and is one form of the PIE.
Now suppose that $\,A = A_1=A_2=\dots=A_n.\,$ Use the binomial theorem to get $$ (1-A(x))^n = \sum_{k=0}^n {n\choose{k}} (-1)^k A(x)^{k}$$ but since $\,A(x)\,$ and $\,1-A(x)\,$ are Boolean valued and if also $\,n>0\,$ this simplifies to $$ 1-A(x) = (1-A(x))^n = 1 + A(x)\sum_{k=1}^n {n\choose{k}} (-1)^k.$$ If now $\,A(x)\ne 0\,$ then this implies that $$ -1 = \sum_{k=1}^n {n\choose{k}} (-1)^k \qquad \text{ and } \qquad 0 = \sum_{k=0}^n {n\choose{k}} (-1)^k . $$