Hamilton shows that this is a correct representation in the book, but the approach may seem a bit counterintuitive. Let me therefore first give a high-level answer that motivates his modeling choice and then elaborate a bit on his derivation.
Motivation:
As should become clear from reading Chapter 13, there are many ways to write a dynamic model in state space form. We should therefore ask why Hamilton chose this particular representation. The reason is that that this representation keeps the dimensionality of the state vector low. Intuitively, you would think (or at least I would) that the state vector for an ARMA($p$,$q$) needs to be at least of dimension $p+q$. After all, just from observing say $y_{t-1}$, we cannot infer the value of $\epsilon_{t-1}$. Yet he shows that we can define the state-space representation in a clever way that leaves the state vector of dimension of at most $r = \max\{p, q + 1 \}$. Keeping the state dimensionality low may be important for the computational implementation, I guess. It turns out that his state-space representation also offers a nice interpretation of an ARMA process: the unobserved state is an AR($p$), while the MA($q$) part arises due to measurement error.
Derivation:
Now for the derivation. First note that, using lag operator notation, the ARMA(p,q) is defined as:
$$
(1-\phi_1L - \ldots - \phi_rL^r)(y_t - \mu) =(1 + \theta_1L + \ldots + \theta_{r-1}L^{r-1})\epsilon_t
$$
where we let $\phi_j = 0$ for $j>p$, and $\theta_j = 0$ for $j>q$ and we omit $\theta_r$ since $r$ is at least $q+1$. So all we need to show is that his state and observation equations imply the equation above. Let the state vector be
$$
\mathbf{\xi}_t = \{\xi_{1,t}, \xi_{2,t},\ldots,\xi_{r,t}\}^\top
$$
Now look at the state equation. You can check that equations $2$ to $r$ simply move the entries $\xi_{i,t}$ to $\xi_{i-1,t+1}$ one period ahead and discard $\xi_{r,t}$ in the state vector at $t+1$. The first equation, defining $\xi_{i,t+1}$ is therefore the relevant one. Writing it out:
$$
\xi_{1,t+1} = \phi_1 \xi_{1,t} + \phi_2 \xi_{2,t} + \ldots + \phi_r \xi_{r,t} + \epsilon_{t+1}
$$
Since the second element of $\mathbf{\xi_{t}}$ is the first element of $\mathbf{\xi_{t-1}}$ and the third element of the $\mathbf{\xi_{t}}$ is the first element of $\mathbf{\xi_{t-2}}$ and so on, we can rewrite this, using lag operator notation and moving the lag polynomial to the left hand side (equation 13.1.24 in H.):
$$
(1-\phi_1L - \ldots - \phi_rL^r)\xi_{1,t+1} = \epsilon_{t+1}
$$
So the hidden state follows an autoregressive process. Similarly, the observation equation is
$$
y_t = \mu + \xi_{1,t} + \theta_1\xi_{2,t} + \ldots + \theta_{r-1}\xi_{r-1,t}
$$
or
$$
y_t - \mu = (1 + \theta_1L + \ldots + \theta_{r-1}L^{r-1})\xi_{1,t}
$$
This does not look much like an ARMA so far, but now comes the nice part: multiply the last equation by $(1-\phi_1L - \ldots - \phi_rL^r)$:
$$
(1-\phi_1L - \ldots - \phi_rL^r)(y_t - \mu) = (1 + \theta_1L + \ldots + \theta_{r-1}L^{r-1})(1-\phi_1L - \ldots - \phi_rL^r)y_t
$$
But from the state equation (lagged by one period), we have $(1-\phi_1L - \ldots - \phi_rL^r)\xi_{1,t} = \epsilon_{t}$! So the above is equivalent to
$$
(1-\phi_1L - \ldots - \phi_rL^r)(y_t - \mu) = (1 + \theta_1L + \ldots + \theta_{r-1}L^{r-1})\epsilon_{t}
$$
which is exactly what we needed to show! So the state-observation system correctly represents the ARMA(p,q). I was really just paraphrasing Hamilton, but I hope that this is useful anyway.
Best Answer
Provided that $|\phi| < 1$ we can as in the OP define $X_t := \sum_{j = 0}^\infty \phi^j Z_{t -j}$, so $(1-\phi B) \,X_t = Z_t$. Let us denote temporarily by $Y^\star_t$ the observation of the candidate State-Space (SS) representation. Since $Y_t^{\star} := \theta X_{t-1} + X_{t}$, by applying $(1 - \phi B)$ to each side of the last relation, we get the relation (1) with $Y^\star_t$ replacing $Y_t$. So, the same model holds for $Y^\star_t$ and $Y_t$. And the SS representation with $\mathbf{X}_t := [X_{t-1}, \, X_t]'$ gives the wanted model for $Y_t$.
The series $X_t$ applies a linear filter $1/(1 - \phi B)$ to the white noise $Z_t$. We can describe $X_t$ as a "coloured noise", yet I can not see a better interpretation to answer to question 2. To get an interpretable (equivalent) representation, we could go with the state vector $\mathbf{V}_t = [Y_{t},\, \hat{Y}_{t+1|t}]'$ and the observation equation $Y_t = [1,\, 0] \mathbf{V}_t$. This generalises to the $\text{ARMA}(p, \,q)$ case with $r := \max\{p,\,q +1\}$ by taking $\mathbf{V}_t := [Y_{t},\, \hat{Y}_{t+1|t}, \,\dots,\,Y_{t+r-1|t}]'$ with the obvious observation equation $Y_t = [1,\, 0, \, \dots,\,0]'\mathbf{V}_t$. This is sometimes called the Akaike SS representation.
Note also that in a SS representation, we need an initial covariance matrix, be it that $\mathbf{X}_0$ or that of $\mathbf{X}_1$, in both cases conditional on the empty observation set preceding the observation of $Y_1$. For the SS representation of the $\text{ARMA}(1,\,1)$ above, this covariance matrix is easily derived.