Why not apply the general argument, valid for every subcritical or critical branching process?
Conditionally on $Z_1=k$, the probability to go extinct is $\theta^k$, where $\theta$ is the probability that the whole process, starting from $Z_0=1$, goes extinct. Thus, $$\theta=\sum_kP(Z_1=k)\theta^k=G(\theta),\qquad G(s)=E(s^{Z_1}),$$ where $G$ denotes the generating function of the number of offspring of each individual. If $E(Z_1)\leqslant1$ (and $P(Z_1=1)\ne1$), the function $G$ is strictly convex with $G'(1)\leqslant1$ hence $G(t)\gt t$ for every $t$ in $[0,1)$, in particular, $\theta=1$.
$\hspace{25ex}$
Source: Any textbook on branching processes (which one are you following?).
Note: The argument in your question, based on the expectations, works when $\lambda\lt1$ but it needs to be made more precise. Indeed, $E(Z_n)\to0$ and, because $Z_n$ is nonnegative and integer valued, $Z_n\geqslant\mathbf 1_{Z_n\ne0}$ almost surely hence $P(Z_n\ne0)\leqslant E(Z_n)$ and one knows that $P(Z_n\ne0)\to0$. Note that the argument would fail in general for real valued random variables.
As you noticed, the characteristic, moment generating and probability generating functions are related to each others, in that they represent different faces of the same coin. Indeed, they correspond respectively the Fourier, Laplace and Mellin transforms of the probability mass/density function, which are themselves related through changes of variable.
As always in Mathematics, they constitute available tools, which you may take advantage of, whenever the context requires them. It is to be highlighted that they are also unique to each distribution, in such a way they serve as a criterion allowing us to distinguish/recognize the distributions behind a random variable.
Nonetheless, their main purpose is very down-to-earth in the end : they permit to simplify some computations. For example, you have already computed moments of a random variable most certainly, the computations are more or less lengthy when dealing with mean or variance, but they can complexify for higher moments. That is why the probability generating function provides a quicker way to compute them through differentiation, because all the moments are contained and calculated once for all inside the said function, in a way.
Another well-known example is given by linear combinations of random variables. In that case, the probability mass/density function will be the ($n$-fold) convolution product of the individual probability functions, which may turn out to be quite hard to handle. However, thanks to the convolution theorems associated to Fourier/Laplace/Mellin transforms, the linear combination is described simply by the usual product of their characteristic / moment generating / probability generating function respectively. For example, the probability generating function of $X = \lambda_1X_1 + \ldots + \lambda_nX_n$, with $X_1,\ldots,X_n$ independent, is given by $G_X(z) = G_{X_1}\left(z^{\lambda_1}\right) \cdots G_{X_n}\left(z^{\lambda_n}\right)$. When the number of random variables is itself a random variable $N$ (cf. compound processes) and the $X_i \sim X$ are iid, we have even : $G_{X_1+\ldots+X_N}(z) = G_N(G_X(z))$.
Finally, it is to be noted that some distributions involve so nasty expressions that they are even devoid of a closed-form probability mass/density function; in consequence, the characteristic/generating function is the only way to describe them.
Best Answer
Using induction on $n$, you easily get $G_n(s)=1-\alpha^{(1-\beta^n)/(1-\beta)}(1-s)^{\beta^n}$ (reading $1-\alpha^n(1-s)$ for $\beta=1$). As for the expected values, you've got a correct result; these don't exist (i.e. are $\infty$) if $\beta<1$.